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ABSTRACT 

The  relationship  between  the  structure  of  a  neural  network  and  its  ability  to 
perform  nonlinear  mapping  is  analyzed.  A  new  algorithm,  called  the  conjugate  gradient 
optimization  method,  for  calculating  the  weights  and  thresholds  of  a  neural  network  is 
presented.  The  performance  of  the  conjugate  gradient  algorithm  is  then  compared  to  the 
well  known  backpropagation  method  and  shown  to  be  more  computationally  efficient. 
A  neural  network  using  the  conjugate  gradient  algorithm  is  then  appli(^d  tu  tliroc  simple 
examples  to  demonstrate  its  signal  processing  capabilities,  llie  iirst  example  illustrates 
the  ability  of  the  neural  network  to  perform  classihcation.  llic  scccmmI  compares  the 
performance  of  a  one-step  linear  predictor  to  a  neural  network  \\>v  <i  nonlinear  cliat^iic 
time  series.  The  neural  network  predictor  is  shown  to  prcnide  nuuli  j;reater  accuiacy 
than  its  linear  counterj^art.  The  final  application  presented  (l(MiK>iist  rates  the  abilitx  of 
a  neural  network  to  ])erfcjrm  channel  eciualization  for  a  uc^iiniiiiiimiin  phase  channel.  Its 
performance  i.>  then  compared  to  its  linear  equi\alent. 
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I.  INTRODUCTION 

Artificial  neural  networks  have  been  studied  for  many  years  in  the  hope  of 
achieving  human-like  performance.  Neural  networks  consist  of  highl}"  connected  sets 
of  relatively  simple  processing  elements.  Computations  are  performed  collectively  by 
the  entire  network  with  the  activity  distributed  over  all  the  processing  elements.  This 
parallel  distributed  processing  provides  neural  networks  with  the  potential  to  solve 
complex  problems  more  quickly  than  the  currently  well  known  present  serial  methods. 
The  nonlinear  nature  and  simple  structure  of  neural  networks  ])ru\  i(l(>  a  formalism 
for  the  study  of  nonlinear  signal  processing. 

The  application  of  neural  networks  to  signal  processing  inxoKes  tl('\('l()i)ing  an 
understanding  of  the  relationship  between  the  structuie  of  a  neural  network  and  its 
ability  to  perform  the  desired  input  to-output  mapping.  A  neural  netwcnks  structure 
is  defined  by  the  number  and  type  of  processing  elements  in  the  network,  the  values 
of  the  weights  that  connect  the  processing  elements  together,  and  a  tliresliold  \alue 
associated' with  each  processing  element.  Past  work  has  lead  ii)  a  large  variet}-  of 
neural  network  nioch^U.  The  models  include  the  Hopfidd  iKhroiL.  the  >in(ih-  and 
multi-layn  pcrcf ption  network.^,  the  reduced  Coulomb  eiiciyij  (RCI:)  cld.-^^ilifr.  and 
the  adaptive  resonance  theory  (ART)  model  [MqI.  \:\)\).  65  73].  Each  in(;del  differs 
in  its  structure  and  the  manner  in  which  the  weights  and  thresholds  of  the  network 
are  derived.  One  current  method  for  calculating  the  weights  and  tlnesliolds  of  a 
feedforward  multilayer  neural  network,  called  the  backpropagatioii  nietliod.  uses  a 
steepest  descent  method  to  iteratively  adapt  the  weights  and  tliresliold>  (A  the  network 
[Ref.  2:p.  127].  This  method  has  generally  been  shown  to  be  slow  l(j  (onverge  to  the 
optimal  set  of  weights  and  thresholds  for  a  given   problem   [Kef.   !:]>.     300].     The 
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objectives  of  this  thesis  research  were  therefore: 

•  Investigate  the  relationship  between  the  structure  of  a  neural  network  and  its 
ability  to  perform  input-output  mapping. 

•  Develop  an  alternative  to  the  backpropagation  method  that  converges  more 
quickly  to  the  optimal  set  of  weights  and  thresholds  for  any  given  problem. 

•  Compare  the  performance  of  a  neural  network  to  its  linear  counterpart  for  some 
representative  signal  processing  applications. 

Chapter  II  provides  a  general  overview  of  the  theory  of  neural  networks.  A 
graphical  approach  is  employed  to  demonstrate  the  ability  of  neural  networks  to 
perform  nonlinear  mapping  for  various  network  configurations.  The  results  are  then 
related  to  a  theorem  by  Kolmogorov.  The  backpropagation  method  for  calculating 
the  weights  and  thresholds  of  the  neural  network  is  also  introduced. 

Chapter  III  deals  with  the  derivation  of  an  alternatixe  algoritlini  1(;  tlic  back- 
propagation  metliod  for  calculating  the  weights  and  thresholds  of  a  neural  network. 
The  conjugate  gradient  optimization  metliod  is  presented  and  then  ai)plied  lo  the  neu- 
ral network  model.  The  Fibonacci  line  search  method  used  in  conjunction  with  the 
conjugate  gradient  method  is  also  discussed.  Tlie  final  section  of  the  cha])t(M-  presents 
details  concerning  actual  implementation  of  the  algorithm  to  include  experimentally 
derived  parameters. 

Chapter  1\'  presents  the  results  of  the  thesis  research.  The  conjugate  gradient 
algorithm's  performance  is  compared  to  the  backpropagation  method  and  is  shown 
to  be  more  computationally  efficient.  A  neural  network  using  the  conjugate  gradi- 
ent algorithm  is  then  applied  to  three  simple  examples  to  validate  tlie  jierformance 
of  the  new  algorithm  and  to  demonstrate  the  t}-pes  of  tasks  tliat  a  neural  network 
can  perform.  The  first  example  illustrates  the  neural  network's  abilit\-  to  perform 
classification.  A  two  input  neural  network  is  successfull\  "taught"  to  differentiate  be- 
tween sets  of  points  falling  inside  and  outside  a  ciicle.  The  second  example  compares 
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the  performance  of  a  one-step  linear  predictor  to  a  neural  network  for  a  nonlinear 
chaotic  time  series  generated  using  the  Feigenbaum  logistic  function.  This  ajiplica- 
tion  demonstrates  the  nonlinear  mapping  ability  of  the  neural  network.  The  neural 
network  predictor  is  shown  to  provide  much  greater  accuracy  than  its  linear  counter- 
part. The  final  application  presented  demonstrates  the  ability  of  a  neural  network  to 
perform  channel  equalization  for  a  nonminimum  phase  channel.  Its  performance  is 
compared  to  its  linear  equivalent  and  is  shown  to  provide  superior  performance. 

Chapter  \'  contains  the  overall  conclusions  of  the  thesis  research  and  pro\-ides 
recommendations  for  future  research. 


II.  FUNDAMENTALS  -  HOW  NEURAL 
NETWORKS  WORK 

A.     THE  BASIC  BUILDING  BLOCK 

A  neural  network  is  a  system  of  relatively  simple  processing  elements  whose 
function  is  determined  by  its  network  structure,  connection  weights,  and  the  transfer 
function  of  each  neuron.  Figure  2.1  shows  a  single  artificial  neuron,  tlie  fundamental 

building  block  for  all  neural  networks.     .'\  set  of  inputs  x-^.d^ /„   arc  a])plied 

through  a  set  of  associated  connection  weights  ici,  W2 i-l\,  to  the  neuron. 


Figure  2.1:   A  single  artificial  neuron 

The  inputs  correspond  to  the  stimulation  levels  and  the  weights  to  the  .synap- 
tic strengths  of  a  biological  neuron.  The  neuron  sums  the  weighted  inputs,  adds  a 
threshold  value,  and  applies  the  result  to  the  neuron's  transfer  function  f(x).  This 
operation  can  be  expressed  as 


/  i:''vr,  +  ^ 


(2.1) 


or  in  vector  notation 


-  f  (w'^x  +  e) 


(2.2) 


where  x  is  a  column  vector  of  inputs,  w  the  corresponding  column  vector  of  weights, 
and  6  the  neuron's  threshold  value. 

B.      THE  TRANSFER  FUNCTION 

A  number  of  possibilities  arise  for  selection  of  an  appropriate  transfer  function. 
These  include  most  notably:  the  signum  function,  the  linear  function,  and  the  sigmoid 
function.  Initial  research  conducted  in  the  1950's  and  1960"s  by  Rosenblal.  Minsky 
and  others  used  the  signum  function  shown  in  Figure  2.2  [Ref.  3].  The  signiiiii  luiictioii 
will  be  used  for  a  preliminar\-  discussion  of  how  neural  networks  opcralc 


I  /(x^ 


Figure  2.2:  Signum  function 


Artificial  neurons  using  the  signum  transfer  function  were  referred  to  as  percej^ 
trons  [Ref.  3].  The  signum  transfer  function  causes  the  out]ml  of  the  ])erccpt ion  to 
take  one  of  two  discrete  values.  The  point  at  which  the  neuron  switclics  fujin  U)\v  \o 
high  or  high  to  knv  i>  determined  by  the  input  \veiglit>  and  llic  pciccpi  run  -  llnc^li- 
old  value.  It  has  been  sliown  that  a  single  [^erceptron  lia>  the  al)ility  to  distinguish 
between  two  classes  of  inputs  [Ref.  4:p.  13].  This  is  demonstrated  in  Figure  2.3  for  a 
two  input  network. 


The  combination  of  weights  {tui  and  W2)  and  the  offset  {0)  define  a  line  where  tlic 
output  of  the  network  (c)  is  high  for  the  class  of  inputs  falling  on  one  side  of  the  line 
and  low  for  the  second  class  of  inputs  falling  on  the  other  side.  If  there  Hvr  11  inputs 
to  a  single  perceptron,  as  pictured  in  Figure  2.1,  the  perceptron  can  construct  an  n 
dimensional  hyperplane  separating  the  two  classes  of  inputs.  Input  clas.ses  that  cannot 
be  separated  by  a  simple  hyperplane  therefore  cannot  be  accurately  differentiated  by 
a  single  perceptron. 

This  problem  can  be  remedied  by  cascading  the  perceptrons  into  sex^eral  layers. 
This  type  of  network  topology-  is  called  a  feedforward  network  because^  the  output 
from  the  previous  layer  is  fed  forward  to  only  the  neurons  in  tlie  next  la\er  of  the 
network.  By  adding  additional  layers,  more  com]:)le.N  boundaries  can  l^e  defined.  A 
two  layer  network  is  capable  of  dehning  decisiuu  regions  that  arc  r(;n\e.\  cm  concave 
in  shape.  For  the  two  input  case  shown  in  Figure  2.4.  each  percei)tron  in  tli(^  fiist 
layer  defines  a  boundary  line.  A  single  second  la}-er  perceptron  weights  and  coniljines 
the  outputs  from  the  first  layer  perceptrons  to  produce  the  two  decisicju  regions.  As 
pictured  in  Figure  2.4  a  two  layer  network  can  also  define  a  single  enclosed  region. 
With  the  addition  of  a  third  laA'er,  disjoint  enclosed  regions  can  be  combined  to  create 
a  decision  map  of  an\-  ari^itrary  complexity.  gi\-en  a  suflicient  nuinljei'  of  peicepti-ons 
in  each  layer.  This  is  illustrated  in  Figure  2.5. 

The  performance  of  a  multilayer  perceptron  network  using  the  signum  transfer 
function  is  satisfactory  provided  the  desired  output  from  the  network  is  limited  to  two 
discrete  values  (i.e.,  high  or  low).  This  would  be  appropriate  for  a  binar\'  classifier 
system,  where  each  output  would  represent  one  ol  two  classes,  i.e.,  a  l;)inar\'  \'alue. 
It  does  not,  howe\er.  pro\'ide  sufficient  resolution  for  analog  (continuous!}'  valued)  or 
the  corresponding  discrete  valued  output  functions  associated  with  most  other  signal 
processing  applications. 


Decision  Bouiulaiy 


Figure  2.3:  Single  neuron  and  associated  decision  regions 
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Figure  2.4:   Two  layer  network  and  associated  decision  regions 


Figure  2.5:  Three  layer  network  and  associated  decision  regions 


One  example  of  a  transfer  function  that  would  be  capable  of  providing  such  a 
continuously  variable  output  is  the  linear  transfer  function.  In  this  case,  the  output  of 
the  artificial  neuron  would  simpl}'  be  the  weighted  sum  of  the  inputs  jjIus  the  neuron's 
threshold  value.  This  can  be  expressed  as 

z  =  f{x)  =  Y.'^^'.r.+0  (2.3) 

or  in  vector  notation 

2  r=/(x)  =  w^x  +  ^.  (2.4) 

This  is  the  transfer  function  used  by  Widrow  and  Hoff  in  theii-  de\-elopineiit  of  the 
adaptive  linear  (adaline)  and  multiple  adaptive  linear  (madaline)  fili-'is  [Hcf.  5:p. 
10].  A  great  deal  has  been  written  concerning  research  and  applical  ioii^  of  the 
adaptive  linear  filter  altliougli  it  has  nut  c^ften  been  referred  to  a.--  a  ncuial  model 
[Ref.  6].[Hef.  7].[Ref.  8].  One  key  feature  of  the  linear  neural  network  is  that  there  is 
no  functional  difference  between  a  multila>cr  and  a  single  layer  nctwc^ik.  for  example, 
for  the  simple  two  la\er  network  in  Figure  2.0  the  output  of  the  fii>t  laxci  neurons 
can  be  written  as 

J\{x,.X2)  =  U'].ri  +  tr2,r2 +  ^1  (2.5) 

and 

./2(.ri..r2)  -  "'3.'-i  ^ir,.v>^0,.  (2.6) 

The  output  of  the  network  can  then  be  written  as 

/sl-ri.  X2)  =  u'5/i(-ii .  .i-2)  +  iv^Mx^ .  .r,)  +  O-,.  (2.7) 

After  some  algebraic  manipulation  and  substitution,  the  final  result  i^' 

/3(.ri..r2)  =  u-rju-i  +  a-:i).i^  +  weitr,  +  lc^j.i-,  +  (ir^Oi  +  u\.0>  +  f/J.  (2.8) 
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From  the  above  discussion,  it  is  clear  that,  regardless  of  the  number  of  layers  in  the 
network,  the  network  can  alwa>s  be  reduced  to  a  single  layer  network.    Essential]\- 
then,  the  linear  adaptive  filter  is  nothing  more  than  the  linear  version  of  a  single  la\er  • 
neural  network. 

A  third  transfer  function  which  has  been  recently  popularized  by  Rumelhart  et 
al.  [Ref.  9]  is  called  the  sigmoid  function.  It  is  defined  by  the  equation 

/(-■)  =  TTT^  (2-9) 

where 

c  =  Y,  «■<•'•,  +  0^  w'^x  ^0.  (2.10) 

The  sigmoid  function,  pictured  in  Figure  2.7.  has  a  shape  which  would  appciir  lo  lall 
somewhere  between  the  linear  transfer  function  and  the  signum  transfer  fuiu  lion. 
Its  output  is  limited  to  a  continuous  range  of  \alues  between  zero  and  one.  For  \alues 
of  z  near  zero,  the  transfer  function  behaxcs  in  a  linear  fashion  with  a  const  aul  slope 
of  one.  If  the  input  weights  to  the  neuron  are  ke|jt  sufficientl\'  small  and  the  range 
of  input  values  limited,  the  sigmoidal  artificial  neuron  can  be  made  to  appear  linear. 
Likewise,  by  using  large  values  for  the  input  weights  w.  the  values  for  ;  wouhl  \aiy 
more  rapidlx'  and  the  sigmoidal  artificial  neuron  would  more  closel\-  a]^proxiiiia1(>  tjie 
signum  function.  .As  a  result,  the  output  of  the  network  can  be  made  to  appioxiiiiate 
both  linear  and  nonlinear  combinations  of  the  inputs  depending  on  the  \ahie>  of  the 
network's  weights  (■w)  and  thresholds  (9). 

A  theorem  developed  by  Kolmogorov  and  described  in  I{eference  10  i)rovides 
further  insight  into  the  potential  capabilities  of  a  multilayer  sigmoidal  neural  network. 
The  theorem  states  that  any  continuous  function  of  n  variables  can  be  represeiit(^d 
using  only  linear  sunnnations  and  nonlinear  but  continuouslx'  increasing  Junctions  of 
only  one  variable.  This  would  indicate  that  a  three  la\-er  artificial  neuron  leedlorw  ard 
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Figure  2.6:   Two  layer  linear  network 


signum  sigmoid  linear 

Figure  2.7:  Neuron  transfer  functions 


network  using  a  sigmoidal  transfer  function  is  capable  of  representing  any  nonlin- 
ear rnultivariable  function.  The  theorem,  however,  does  not  indicate  the  nuniIj<M  ol 
neurons  required  in  each  layer,  or  how  the  values  for  the  weights  should  he  derixed. 
It  has  been  suggested  that  one  approach  to  representing  an  Ji-diniensional  non- 
linear function  using  neural  networks  might  be  by  a  weighted  combination  ol  ?/- 
dimensional  'bumps'  [Ref.  11].  This  is  somewhat  analogous  to  the  Fourier  series 
representation  of  an  arbitrary  signal  where  weighted  combinations  of  sinusoids  ol 
suitable  frequencies  are  used.  To  see  how  a  nonlinear  function  might  \)c  represented 
using  a  sigmoidal  neural  network,  let  us  look  at  the  case  wIhmc  we  lia\(^  a  iionlineai 
function  of  two  variables.  1  he  output  of  the  nonlinear  function  could  be  intcrprctec 
a5  a  two  dimensional  surface  in  a  three  dimensional  s]>ace.  The  outi:)ut  of  a  singh 
sigmoidal  neuron  would  have  a  surface  like  that  j)ictured  in  1'  igure  2.S. 


OZ  0  2  ''x,  AXIS 

Figure  2.8:   A  sigmoid  surface 


The  orientation  of  the  rising  slope  of  the  sigmoidal  surlace  is  determined  I 
neuron's  input  weights  (w).  its  position  is  deteiinined  b\-  its  threshold  iO) 
The  height  of  the  surface  is  controlled  by  the  weight  coimcctcd  to  the  oulpiit 


vain 
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neuron.  If  we  add  a  second  neuron  witli  the  same  orientation,  but  a  slight!}-  difiorciit 
position  than  the  first  by  using  a  different  threshold  vahie  {()),  and  use  an  output 
weight  equal  to  but  opposite  in  sign  of  the  first,  wc  can  form  a  ridge  as  shown  in 
Figure  2.9. 


Figure  2.9:   A  ridge 


A  second  ridge,  perpendicular  to  the  first,  can  then  be  constructed  by  adding 
two  additional  neurons  to  the  first  layer  and  selecting  aj)propriate  input  weight  values. 
The  sum  of  the  two  ridges  then  forms  the  surface  j)ictured  in  I^'iguic  2.10.  'llie  weights 
connecting  the  outputs  of  the  first  layer  neurons  to  the  single  second  layci  ikmuou 
along  with  the  second  layer  neuron's  threshold  value  can  then  be  adjusted  to  \ield  a 
true  bump  shown  in  Figure  2.11. 

We  can  now  represent  any  surface  as  a  combination  of  these  bumjjs.  1  he  network' 
topology  to  accomplish  this  would  consist  of  multi|)le  copies  of  two  layer  network  and 
a  single  third  layer  neuron  to  weight  and  sum  the  bumj:)S.  The  resulting  surface  is 
pictured  in  Figure  2.12.  Tlie  preceding  development  ])rovides  some  insight  into  the 
number  of  neurons  required  in  each  layer  of  a  neural  net  work  to  adecpiaiclx-  r(^|)i('senl 

12 


Oq  02  "  XI  «« 


Figure  2.10:  A  pseudo-bump 


ot^"^)!  %^  WIS 

Figure  2.11:   A  bump 


0%  0.1  "  XI  **« 

Figure  2.12:   Multiple  bumps 


a  given  nonlinear  function.  A  given  function  might  be  more  efTicientl\-  represented 
using  a  combination  of  sigmoidal  surfaces  or  ridges  rather  than  bumps.  The  better 
knowledge  one  has  of  the  function  to  be  represented  will  lead  to  a  better  decision 
concerning  the  neural  network  topology  required. 

C.      CALCULATION  OF  WEIGHTS  AND  THRESHOLDS 

The  burning  question  that  has  yet  to  be  addressed  concerning  the  feedforward 
sigmoidal  neural  network  is  how  do  we  calculate  the  weights  (w)  and  the  neuron 
thresholds  (^)  of  the  network  to  yield  a  satisfactory  representation  of  a  gi\'en  nonlinear 
function.  A  method  calhxl  backpiopagation.  developed  b\-  Kiiniclhin  1 .  has  pioven 
popular  and  has  been  demonstrated  to  work  fairl\-  well  [Kef.  '_'].  The  bac  kpr(;])agat  ion 
method  uses  a  training  data  set  consisting  of  sets  of  iniKits  and  a  desired  out  put  \alue. 
A  set  of  inputs  is  applied  to  the  neural  network  and  the  resulting  network  output  is 
compared  to  the  desired  value.  The  error  between  the  neural  network's  output  and 
the  desired  output,  along  with  the  current  state  of  neural  network,  is  used  to  modify 
the  neural  network's  weights  and  threshold  \alues.  'ihe  state  til'  the  neuial  network  is 
defined  by  the  current  ijijnit  to  the  network,  its  w(>ights.  threslnilds.  and  each  neuron's 
transfer  function.  The  backpropagation  method  attempts  to  minimize  the  sum  of  the 
squared  errors  over  the  entire  training  data  s(n.  This  can  l)e  e.\|Me>se(l  ,i> 

where  E  is  the  total  squared  error.  t{t)  is  the  network  output  error  for  the  /'*'  input 
set.  y{i)  is  the  desired  or  target  output  foi  the  /"'  input  set.  and  z(i]  is  the  actual 
output  of  the  neural  net  for  the  i^^^  input  set.  The  weights  and  the  tliLesholds  of  the 
network  are  iteratively  updated  in  proportion  to  the  gradient   of  the  total  squared 


error.  E.  This  can  be  expressed  as 


8E 

w[n-^\)  =  u,-(/0  +  Au-(;0  =  nin)  -  ——-■  t  (2.12)  ■ 

0U'(7?) 


and 


8E 

e(n  +  1)  =  e[n)  +  A^(n)  =  Bin)  -  ——  ■  e  (2.13) 

bU[ri) 

where  xv{n)  and  6[n)  are  the  weights  and  thresholds  at  the  n'''  iteration  of  the  algo- 
rithm. Au-(r7)  and  ^0[n)  are  the  incremental  changes  to  the  weights  and  thresholds, 
and  t  is  the  proportionality  constant  [Ref.  2:p.  130].  The  backproi)agation  method 
gets  its  name  from  the  fact  that  the  error  at  the  output  of  the  network  is  i>ropagated 
back  through  the  network  in  the  form  of  gradients  in  order  to  update  the  network's 
weights  and  threshold.^. 

The  backpropagation  method  is  essentially  a  steepest  descent  o]>tiinization  al- 
gorithm which  use:-  the  gradient  of  the  squared  erior  function  to  niodifv  liic  weights 
and  thresholds  of  the  neural  network  [Ref.  2:p.  127].  One  requirement  dictated  b\- 
this  gradient  method  is  that  the  transfer  function  of  the  neurons  be  continuously  dif- 
ferentiable  [Ref.  '1:\).  131].  .As  a  result,  this  method  cannot  be  used  with  tiic  signum 
transfer  function  because  of  its  discontinuity.  The  method,  however,  does  work  for 
the  linear  and  sigmoidal  transfer  function  cases. 

.As  presented  above,  the  weights  and  thresholds  are  updated  after  a  comj)lete 
peiss  of  the  entire  training  data  set  through  the  network.  In  the  actual  implemen- 
tation of  the  algorithm,  however.  Rumelfiart  updates  the  weights  and  thresholds  of 
the  network  after  each  input/desired  output  ])aii  is  ap]jlied  [Ref.  2:pp.  136-137]. 
His  rationale  for  doing  this  is  that  the  algorithm  converges  so  slowly  tliat  it  does 
not  affect  the  overall  convergence  rate,  and  that  it  is  more  gratifying  to  update  the 
weights  and  thresholds  more  frequently  [Ref.  2:p.  137].  .As  Rummelharl  indicated, 
the  steepest  descent  method  is  extremely  slow  to  converge.  It  was  this  deficiency  that 
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led  to  the  development  of  this  thesis  project.  Lapedes  and  Farber  indicated  that  a 
related  optimization  method,  the  conjugate  gradient  algorithm,  yielded  a  significant 
improvement  in  the  convergence  rate  of  the  backpropagation  metliod  [Rcf.  12].  The 
following  chapter  will  address  the  development  and  application  of  this  optimization 
method  to  a  feedforward  sigmoidal  neural  network. 


III.  DERIVATION  OF  THE  ADAPTATION 
ALGORITHM 

A.     THE  CONJUGATE  GRADIENT  METHOD 
1.      General  Description 

The  conjugate  gradient  method  is  an  iterative  method  for  optimizing  a 
set  of  coefficients  h  in  order  to  minimize  a  given  objective  function  J(h).  It  falls 
into  the  class  ol  optimization  methods  that  a])ply  a  multicliincnsional  seaich  using 
derivatives  to  the  optimization  problem  [Ref.  13:pp.  289-316].  The  steepest  descent 
method,  which  Rumelliart  uses  lor  adapting  the  feedforward  neural  network,  is  also 
a  member  of  this  class  [Ref.  2].  This  class  of  optimization  methods,  called  gradient 
methods,  treat  the  objectixe  function  J(h)  as  a  multidimensional  surface  cj\er  which 
it  iteratively  searches  for  the  absolute  or  global  minimum  [Kef.  13:pp.  2S')  3Ki].  The 
coefficients  h  are  the  multidimensional  coordinates  which  define  where  the  algorithm 
is  located  on  the  surface  during  any  particular  iteration.  This  class  of  oi)timizat  ion 
methods  require  that  the  objecti\'e  function  be  differential)le  with  re>i)(>ct  to  the 
coefficients  h  that  are  adapted  [Ref.  13:p.  289].  This  partial  deri\ai  i\c  is  calh-d  the 
gradient  g  of  the  objecti\e  function.  When  evaluated  for  a  gi\eii  set  of  ( oeljicieuls  h, 
the  gradient  g  is  a  multidimensional  \ector  which  is  tangent  to  1  he  (;!)](■(  1  i\c  luuciioii 
surface  at  a  point  defined  by  the  coefficients  h.  This  \ector  ])oinl>  in  the  (lii(-(iic;ii  (A 
greatest  increase.  The  negati\-e  of  the  gradient  (— g)  logicall\'  jjoints  downhill  in  the 
direction  of  greatest  decrease.  Thus,  the  gradient  \ector  g  can  pro\ide  a  diiection 
along  the  surface  of  the  objective  function  in  which  to  search  for  tlie  gloi>al  miiiinnim. 
The  advantage  of  gradient  methods  is  that  they  decom])ose  the  opt  imizat  ion  i)r(;b!em 
from  a  multidimensional  search  of  the  objectixe  function  surface  1(;  a  secpiciic c  ol'  line 
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searches  along  directions  determined  by  the  gradient  vector  g. 

The  method  of  steepest  descent  uses  the  gradient  vector  g  direct!}'  to  per- 
form its  iterative  Hne  search  of  the  objective  function  surface  [Ref.  14:pp.  214-220]. 
Rumelhart  points  out  that  the  steepest  descent  method  works  well  when  the  objective 
function  surface  is  quadratic  or  bowl-shaped  with  a  single  global  minimum  [Ref.  2:p. 
132].  He  states,  however,  that  the  more  complex  objective  function  surfaces  associ- 
ated with  multilayer  neural  networks  frequently  contain  many  local  minima  [Ref.  2:p. 
132].  As  a  result,  the  steepest  descent  method  can  become  trapped  in  one  of  these 
local  minima  yielding  a  less  than  optimal  solution.  This  is  because  the  magnitude  of 
gradient  vector  decreases  as  the  algorithm  approaches  a  local  minimniu.  Tiic  distance 
the  steepest  descent  algorithm  tra\els  for  a  given  iteration  is  a  function  ol'  a  constant 
times  the  magnitude  of  the  gradient.  Therefore,  as  the  magnitude  c;!  the  uLidicnt  de- 
creases, the  distance  the  algorithm  travels  along  the  surface  decrease.  Compounding 
the  problen:!  is  the  fact  that  each  successi\e  gradient  is  orthogonal  to  the-  previous 
gradient.  This  causes  the  algorithm  to  zigzag  in  ever  smaller  step>  as  it  a])proaclies 
the  bottom  of  a  local  minimum.  The  result  is  that  the  algoritlini  becomes  trapped 
at  the  bottom  of  a  local  minimum  and  ne\er  reaches  the  optiiiud  pwini  or  gh^bal 
minimum.  Use  of  a  constant  stepsize  also  causes  the  steepest  descent  algorithm  to  be 
extremely  slow  to  converge  [Ref.  13:pp.  290-291]. 

The  conjugate  gradient  approach  is  motivated  b\-  a  desire  to  accelerate 
the  convergence  rate  of  the  steepest  descent  method  without  greatl\-  increasing  the 
complexit}-  of  the  algorithm.  The  conjugate  gradient  method  uses  a  succession  of 
direction  vectors  d^.  that  are  conjugate  to  the  gradient  \ector  g/,  ol>tainetl  as  the 
algorithm  progresses.  I'he  direction  along  which  the  algorithm  seaiches.  d^-.  is  a 
linear  combination  of  present  and  past  values  of  the  gradient  \-ector.  The  result  is 
that   the  gradient   vector  g^   is  orthogonal  to  the  subspace  F^-  which  is  dehned  by 
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the  set  of  all  previous  direction  vectors  do,di d/,._i.    Each  successi\c  iteration 

essentiall}"  adds  an  additional  dimension  to  the  subspace  1/,..  The  distance  o/,.  that 
the  algorithm  traxels  along  the  line  search  direction  d/,-  also  \-aries  for  each  iteration 
of  the  algorithm.  This  makes  the  method  only  slightly  more  complicated  than  the 
steepest  descent  method.  The  algorithm,  however,  does  not  become  trapped  in  local 
minima  as  easily  as  the  steepest  descent  method  and  converges  steadily  to  the  global 
minimum  or  optimal  set  of  coefficients  h/,.  [Ref.  13:pp.  297-31C]. 

2.  Notation  Summary 

The  notation  used  [o  describe  the  conjugate  gradient  nietliod  is  as  follows: 

J(h)  Objective  function  to  be  minimized. 

hi.  Coefficient  vector  at  the  A-'^'  iteration. 

g.  Gradient   \e(tur  of  t  he  object  ive  fuiict  ion  at   llie/.'''  iter, it  i. .11. 

d/,.  .Search  direction  \('(lcjr  at  the  /.'''  iteration. 

Q/,.  Search  distance  coefficient  at  the  A-'*'  iteration. 

Jjt  Deflection  coefficient  at  the  A'^'^  iteration. 

3.  Summary  of  the  Conjugate  Gradient  Algorithm 

.A  sunuiiaiy  ol   the  conjugate  gradient   method   toi    iiiiiiinii/.ing  a  difleren- 
tiai)le  objective  function  .y(h)  is  listed  below  [Ref.  l:3:p.  306]: 

Step  1.        Choose  an  initial  set  of  coefficients  ho- 

Step  2.        Calculate  the  initial  gradient  gu  using  the  definition 


Step  3.        Let  the  initial  direction  vector  bo  d^  =  — go- 

Step  4.        Let  k=0. 

Step  5.        Let  a^  be  the  optimal  solution  to  the  problem  to  minimize  J  {hi,  + 
ak<^k)  subject  to  q/,  >  0. 

Step  6.        Update  the  new  coefficients  h/e+i  using  the  equation 

h;..+  i  =  liA  +a;.dA..  (3.2) 

Step  7.        Calculate  the  next  gradient  vector  value  g^^-i  using  tlic  new  coeffi- 
cients h/o+i. 

Step  8.        Calculate  the  deliection  coefficient  .ik  using  the  etiuation 

,     _    (g/>-H    ~  gA-)     gA  +  l  ,.^  .^. 

gJgA 

Step  9.        Update  the  direction  \'ector  d^.)-]  using  tlie  equation 

d/,.+i  =  -gA-+i  +.4d,.  (;L4) 

Step  10.        Replace  /,•  b\-  /.  +  1  and  go  to  step  5. 

4.      Selection  of  a  Line  Search  Method 

The  conjugate  gradient  method  outlined  above  requires  that  a  search  dis- 
tance coefficient  q;^.  be  found  that  minimizes  the  objective  function  J{hk  +  (-n<^k) 
subject  to  o/,  >  0.  This  dictates  that  a  line  search  be  performed  starting  at  the  point 
in  multidimensional  space  defined  ijy  the  current  coeflicient  \eclor  h;,  and  prcjceeding 
along  the  line  defined  by  the  current  direction  vector  d^,  until  the  minimum  value  of 
the  objective  function  is  found.    The  distance  the  line  seaicli  algojithm  traxels  from 
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the  point  h/,.  to  the  minimum  value  of  the  function  is  then  defined  to  be  the  scahir 
value  Of;.  A  number  of  methods  have  been  proposed  to  perform  this  line  search. 
These  include  the  uniform  search,  dichotomous  search,  the  golden  section  method.  ' 
and  the  Fibonacci  method  [Ref.  13:pp.  253-264].  There  is  also  a  class  of  line  search 
methods  which  use  derivati\es  to  assist  in  finding  the  minimum  value  of  the  objective 
function  [Ref.  13:pp.  264-269].  This  second  group  of  methods  was  considered  for 
use  with  the  conjugate  gradient  method  but  were  subsequently  rejected  due  to  the 
complexity  of  calculating  and  evaluating  the  required  derivatives.  The  selection  of 
an  appropriate  line  search  method  for  use  in  conjunction  witfi  the  conjugate  gradient 
method  was  based  primarily  on  efficicncw  .\11  of  the  methods  e.xcepl  for  the  hibouacci 
search  require  two  evaluations  of  the  objecti\-e  function  during  each  iteration  ol  the 
algorithm.  The  Fibonacci  metiiod.  however,  requires  only  a  single  evahiatioii  Ix-causc 
it  also  uses  the  results  from  the  piexious  iteration.  Comparison  c^i  the  line  seai<  h 
methods  mentioned  abo\e  re\-ealed  that  the  Fibonacci  seaicli  nietluxl  is  the  nicest 
efficient  [Ref.  13:p.  264].  .As  a  result,  the  Fibonacci  search  method  was  clu^sen  to  be 
used  in  conjunction  with  the  conjugate  gradient  method. 

The  Fibonacci  method  performs  a  search  for  the  miniiiinm  \ahie  ol  a  hnic- 
tion  of  a  single  variable  over  a  closed  bounded  interval  [a.b].  The  function  in  this 
case  is  J(h;-  +  at;jd^)  wliere  o/,^  is  the  single  \-aiiable.  'Hh'  inter\al  caci  wliidi  the 
algorithm  searches  is  called  the  interval  of  uncerlainty  and  hunts  the  laugc  ol  \abies 
for  Qkj-  The  lower  limit  for  o^.^  is  given  b\'  the  conjugate  gradient  method  as  zeio, 
but  the  upper  limit  must  be  specified  before  the  algorithm  can  l)egin.  llie  intei\al  oi 
uncertainty  is  steadily  reduced  as  the  algorithm  progresses.  The  number  ol  iterations 
which  the  algorithm  will  perform  must  also  be  specified  before  the  stait  of  the  algo- 
rithm. The  Fibonacci  method  is  based  on  the  Fibonacci  sequence  f\.  which  is  defined 


F,+i  =  F,.  +  F„_i  (:i5) 

Fo  =  F,  =  1  (:^.n) 

The  resulting  sequence  is   1.1.2.  3.  5,  8, 13,  21,  34.  55.  89 The  Fibonacci  search 

method  begins  by  evaluating  the  objective  function  at  each  of  two  points  within  the 

interval  of  uncertainty  as  shown  in  figure  3.1. 

These  two  points,  which  we  will  call  A^  and  //j,  are  calculated  using 


A,  =",  +  /^^(^-«,)  (3.7) 


and 


where  k  is  the  iteration  index  of  tlie  conjugate  gradient  algorithm,  j  is  tlx-  iteration 
index  of  the  Fibonacci  algorithm.  [a,.6j  is  the  current  interval  of  uncertaiiit\-.  and  ri 
is  the  total  number  of  iterations  plamied.  A  new  interval  oi  unccrtainlx  [</ ,+  i . />,4_i] 
is  then  selected  based  on  the  \akic  of  the  objectiNc  functitJii  at  \\\r  \\\\)  points  A^  and 
///.  If  J{hk  +  A^d^.)  >  J(hA  +  //,dA)-  then  the  new  inter\al  u\'  inncrlaint>-  [(/,+  ,. /;,  +  ,] 
is  given  by  [A^./^^j.  Likewise,  if  the  opi)osite  is  true.  ./(Iia-  +  A^d;,.)  <  .7(11/,-  +  /',d/..), 
then  the  new  inter\'al  of  uncertainty  is  [rt^, //j].  Both  cases  are  shown  in  Figure  3.2. 
The  key  feature  that  makes  the  Fibonacci  method  so  attractive  i<  thai,  for  the  next 
iteration  j  +  1.  either  A,_,_i  =  //^  or  pj^i  —  A^.  depending  on  which  new  inier\al  of 
uncertaint}'  was  selected.  Since  the  objectixe  function  lias  already  been  e\aluated  at 
the  previous  values  for  A^  and  /i^,  then  onl\'  one  additional  ex'aluation  must  be  made 
for  each  succeeding  iteration.  At  the  comj^letion  of  the  specified  v  iterations  of  the 
algorithm,  the  size  of  the  final  interval  of  uncertainty  will  be 


(6„ -<,„,=  ^^^^-^  (3.9) 


^0  Ao  /'o  bo 

Figure  3.1:   Initial  evaluation  points  Ao  and  //q  and  interval  of  uncertainty 


^j  +  \ 


A.-f.  /'. 


Figure  3.2:   Evaluation  points  Xj^i  and  ftj+i  and  revised  interval  of  uncer 
tainty  when  J[\j}  >  J(fij) 


^j  Xj  I'j  b, 

^j  +  i  Aj  +  i  /'j  +  i  bj  +  i 

Figure  3.3:   Evaluation  points  A_,^.i  and  //j  +  i  and  revised  interval  of  uncer 
tainty  when  .J(\j}  <  J(iij) 
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If  we  select  the  midpoint  of  the  final  interval  of  uncertainty  as  the  value  Of^,,  to  he  used 
by  the  conjugate  gradient  method,  then  we  can  calculate  the  number  of  iterations  n 
required  to  achieve  a  desired  accuracy  after  deciding  upon  an  u]:»j^er  bound  l>u.  The 
upper  bound  and  number  of  iterations  used  for  the  neural  network  juobleni  will  be 
presented  in  the  next  chaptei. 

5.      Calculation  of  the  Deflection  Coefficient  i3f, 

The  equation  used  to  calculate  the  deflection  constant  l:i^  (equation  3.3) 
is  the  Polak-Ribiere  version  of  the  conjugate  gradient  method  originally  proposed  by 
Fletcher  and  Reeves  [Ref.  14:p.  253].  I'lie  original  method  used  th(^  e(|tiatioii 


BA-+16A  +  1 


(3.10) 


g;-  gA 

to  calculate  the  deflection  constant  J/,,  f  he  two  equations  are  e(|ni\al(Mit  if  tlie  ob- 
jective function  to  be  minimized  is  quadratic.  Ex])erimental  results,  lunvcxer.  tend 
to  indicate  that  the  r-*olak-Ril)iere  method  is  more  effect  ixc  for  n(jii(|ua(hat  i(  objec- 
tive functions  [Ref.  1  l:p.  25-J].  This  is  because  the  Polak-Kibiere  melluxl  tends  to 
reset  the  the  direction  xector  di^_^i  to  the  \alue  of  the  gradient  \c(  lor  g/^^,  wlieii 
two  successive  gradients  g;,.  and  gA+i  are  equal.  This  has  the  effect  of  l)egiiiiiing  the 
conjugate  gradient  method  anew,  using  the  present  coefficients  vector  h/,.  as  the  new 
initial  coefficient  vector  ho- 

B.      APPLYING  THE  CONJUGATE  GRADIENT  METHOD  TO  A  NEU- 
RAL NETWORK 

1.      The  Neural  Network  Model  and  Notation 

The  generic  neural  network  model  to  be  used  for  the  purposes  of  discussion 
is  pictured  in  Figure  3.4.  The  notation  used  when  referring  to  the  \-arious  variables 
of  the  model  is  as  follows: 


Figure  3.4:   Neural  network  model 


x^J  The  j'*"  input  to  the  ?''^'  layer  of  the  network.  For  other  than  the  inputs 
X0-I.X02. .  . .  ,Xoi,  the  variable  x,j  is  also  the  output  of  the  j''*  neuron  in  the 
{i  —  If^  layer  and  is  a  function  of  the  previous  layer's  inputs  and  weights  and 
the  J*'*  neuron's  threshold  value. 

w,jk  The  weight  in  the  ?^''  layer  of  the  network  that  connects  the  j'"'  input  x,j  to  the 
k^^''  neuron  of  the  layer. 

9^^,       The  threshold  value  associated  with  the  Z:'^'  neuron  of  the  ?^''  layer  of  neurons. 

y         The  desired  output  value  of  the  network  for  a  given  set  of  inputs  Xq^.  Xq2-  ■  ■  ■  ■  -^"o/- 

/(•)     The  transfer  function  of  the  neuron. 

2.      The  Neural  Network  Objective  Function  .7(h) 

.^s  was  mentioned  in  the  ]ne\iou.s  cliapter.  we  wish  to  iniuiinizc  the  iota! 
sum  of  the  squared  errors  o\er  an  entire  training  data  set.  .As  a  resuh.  the  ohjc.'ctive 
function  J(h)  to  be  minimized  using  the  conjugate  gradient  met  hod  is 


E  =  j2-/in  (3.11) 


where  e{t)  is  the  error  between  the  actual  and   the  desired  outputs  of  thr  neural 
network  for  the  /'*'  data  set. 

3.      The  Adaptation  Coefficients  h 

There  are  two  cpiantities  that  we  wisli  to  adajjt  in  order  for  the  neural 
network  to  consistently  produce  the  desired  output  for  a  gi\en  input.  '1  iics(.-  two 
cpiantities  are  the  connection  weights  u\jk  of  the  network  and  the  thi-eshuld  \aiues  0,^^ 
associated  with  each  neuron  in  the  network.  Together,  these  two  sets  of  coefficients 
form  the  coefficient  \ector  h.  The  conjugate  gradient  algorithm  uses  a  singh'  \ector 
h  to  represent  the  coefficients  which  are  adapted  to  minimize  the  objerii\c  function 
J(h).  The  notation  used  for  the  neural  network  model,  however,  reflects  the  use  of 
matrices  [lUijk]  foi'  the  weights  and  vectors  [0;^.]  for  the  thresholds.  This  \vas  done  to 
simplify  the  identification  of  the  various  weights  and  thresholds.   \\>  must  liierefore 
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combine  and  transform  the  weight  matrices  and  threshold  vectors  into  a  single  vector 
h  in  order  to  apph-  the  conjugate  gradient  algorithm.  This  is  done  b>-  assigning  the 
individual  weights  and  thresholds  to  a  vector  as  shown  in  equation  3.12. 

h  =  [u-un-  "orj i^'uim- i^'iii,  •  •  .  at'3, ^ui. ^02. ^2]  (3.12) 

We  can  perform  the  conjugate  gradient  algorithm  using  the  vector  notation  and  then 
perform  a  reverse  transformation  at  the  completion  of  the  algorithm  to  assign  the 
final  weights  and  threshold  values  to  the  neural  network. 
4.      The  Gradient  Vector  g 

The  gradient  \ector  g  used  b\'  the  conjugate  gradient  nuMhod  is  dc-tiiicd  as 

g  =  T^-^(h).  (3.13) 

ah 

The  gradient  vector  g  for  the  neural  network  problem  consists  of  the  gradients  asso- 
ciated with  the  weights  and  thresholds  of  the  neural  network.  "1  he  gradient  wctor  g 
would  be  of  the  form 

g  =  [yuu-yui2 yui,n-yn\-----gj-yooi'ijo^,'----go,]    ■  (3.14) 

The  gradient  for  an\-  particular  weight  or  threshold  of  the  network  is  calculated  by 
taking  the  partial  derivative  of  the  error  function  E  with  respect  to  the  i)articu]ar 
weight  {u\jk)  or  threshold  {9,k}-  For  the  gradient  associated  with  a  weight  this  would 
be  expressed  as 

and  for  the  gradient  associated  with  a  threshold  as 

UE  1      d       \^-^      .1  ,^,r. 


The  partial  derivative  in  equations  3.15  and  3.16  can  be  moved  inside  tiie  respective 
summation  terms  resulting  in  the  following  expressions 

dE         I  y-^      d      ^ 
dw^jk      2  ^  dw.jk 

and 

The  gradient  for  each  weight  w,jk  can  therefore  be  expressed  as  the  sum  of  the  partial 
gradients 

g^Jk  =  J:y'M^)  (3.19) 

where  the  partial  gradient  (y|^/,.(0  is  the  gradient  associated  with  the  weight  a\jk  when 
evaluated  for  a  single  set  of  training  data  rather  than  the  eiit  ire  t  i.iiiiiiig  data  set.  The 
gradients  associated  with  the  threshold  \alues  of  the  neural  netwc^rk  can  be  exinessed 
in  a  similar  manner,  gix'en  by 

yo,.  =  T.fj'oJ^y  (•■^•-0) 

For  the  purposes  of  notational  brevit}".  we  will  assume  that  the  training  data  set 
consists  of  only  one  set  in])uts  and  the  associated  desired  output,  lliis  will  allow 
us  reduce  the  length  of  equations  for  the  gradient  b\'  remo\ing  references  t(;  the 
particular  element  of  the  training  set  used.  The  reader  should  reincnibri .  in>\\c\cr. 
that  if  there  are  .s  pairs  of  data  in  the  training  set,  then  tlie  giadieni  is  the  sum  of 
the  .s  partial  gradients  as  expressed  in  equation  3.19  and  equation  3.20. 
a.      Neuron  Transfer  Function  Derivative 

Before  del\-ing  into  the  derivation  of  the  equations  for  the  gradients 
of  the  weights  and  thresholds  of  the  neural  network,  a  few  connnents  should  be  made 
concerning  the  transfer  function  used  for  the  neural  network  model  and  its  deri\-ati\e. 
The  transfer  function   to  be  used  is  the  sigmoid  function   defined   Ijy  ecjuation   2.9 
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in  Chapter  2.  A  key  feature  of  the  sigmoidal  function  is  that  its  derivati\x^  can  be 
expressed  in  terms  of  its  original  value  by 

£f{x)^-f{x){l-f{x)).  (3.21) 

The  derivative  of  a  neuron's  output  can  thus  be  expressed  as  a  function  of  the  output 
of  the  neuron  and  the  partial  derivative  of  the  neuron's  inputs.  The  partial  derivative 
of  the  neuron's  output  with  respect  to  u',j/,  is  then  given  b\- 

^  =  (-x.,,,)(l  -  .r,,..,)^  U,  -  E  uv  J  (3.22) 

and 

^  =  (-..,.H.-..-,.,.)|-(«.-i:"V.)  (:V2., 

for  the  derivative  with  res|)ect  to  0,),.  Equations  3.22  and  3.23  will  be  used  lrcc|ueiil  !>• 
to  evaluate  the  partial  deri\'atives  of  each  neuron's  output  wlicn  cKm  ixiiiu,  the  (-(luations 
for  the  gradients  of  the  neural  network. 

b.      Calculation  of  the  Third  Layer  Gradient 

The  calculation  of  the  gradients  for  each  weiglil  and  lliresliolrl  ol  the 
neural  network  begins  at  the  output  of  tlie  neural  network  where  llic  (lillcicnce  be- 
tween the  actual  iietw(jik  output  and  the  desired  (Mil  put  piodiKo  .m  eiioi.  Ihis 
error  is  propagated  i^ack  through  the  network  in  the  form  of  gradieuLs.  Ilie  gradicMit 
associated  with  the  output  weight  u'3  can  be  expressed  as 

^3  =  -^  =  T~~y  =  ^r~o    y  "  "^'a-^-s)  (3.24 

du'3  (Jw:i  2  C/U'3  2 

where  1^3.7-3  is  the  output  of  the  network  and  y  is  the  desired  output  \alue.  Taking 
the  partial  derivative  yiekU 

g-i  =  [<J  -  ^''>^3)  -; —  (.V  -  ^''.i-r-s)  =  [y  -  uM.ra)  (-.r.j  .  (3.25) 
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After  rearranging  the  terms  of  equation  3.25,  the  final  form  for  the  output  weight's 
gradient  ^3  becomes 

gs  =  {W3X3  -  y) -Vs.  (3.26) 

c.      Calculation  of  the  Second  Layer  Gradients 

Derivation  of  the  input,  first  and  second  layer  gradients  is  somewhat 

more  involved  than  that  of  the  third  layer  gradients  because  of  the  multiple  neurons 

and  weights  between  the  error  at  the  output  and  the  gradient  for  which  we  are  derix'ing 

an  expression.  The  gradient  equation  for  a  weight  in  the  second  layer  can  be  expressed 

as 

dJ^  d 

Qlj  =  -. =  {y  -  W3X2}  ——  [y  -  LC:^.^■:^)  .  (3.20 

air2j  OW2j 

Of  the  terms  e\'aluated  b\'  the  jjartial  deri\ali\'e.  onl\'  the  out])ut  ol  tlic  third  layer 
neuron  .7-3  is  affected  by  a  \-ariation  of  the  second  la\-er  weight  tr_.^.  riic  desired  output 
y  can  be  eliminated  and  the  partial  d(Mi\-ati\-e  shifted  to  the  right  (jf  tlic  output  weight 
term  W3.  This  3ields  the  expression 

g2j  =  (.'/  -  ir3-r:i)  (-u'3)  TT^  (.^3) .  (3.28) 

Ol-l-'2j 

We  can  replace  the  partial  derivative  term  in  equation  3. 28  with  an  ec|ui\alent  ex- 
pression that  can  be  e\-aluated  with  respect  to  tr^y  i^ising  equation  3.22.  This  results 
in  the  following  expression 

92j  =  iy-w3.r.i){-x;){-w:,){l  -  x-,)  7^  (^2  -  E  "'ir'^  J  •  (3.29) 

Comparing  the  first  part  of  equation  3.29  with  equation  3.2G.  we  find  that  we  can 
replace  the  first  two  terms  of  equation  3.29  with  the  output  weight's  gradient  (73. 
After  taking  the  partial  derivatixe.  only  one  term.  x-j,.  remains.  The  equation  for  the 
second  layer  weight  gradient  becomes 

92,,   =^3(-a'3)(l   -^3)(-^2j)=g3W:Al    -^■3}X2.r  (3.30) 
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We  can  see  from  equation  3.30  that  the  gradient  ^07  is  a  function  of  weight  103  that 
connects  the  neuron's  output  to  the  next  la\er.  the  gradient  g-^  that  is  associated  with 
the  output  weight,  the  neuron's  output  \-alue  .1-3.  and  the  input  .roj  tliat  is  applied  to 
the  weight  for  which  we  are  calculating  the  gradient  (^2j)-  This  relationship  between 
the  inputs,  outputs,  weights  and  gradients  will  be  found  to  be  consistent  for  each  of 
the  gradients  of  the  neural  network. 

Rather  than  starting  from  scratch  to  derive  the  equation  for  tlie  gra- 
dient associated  with  the  output  neuron's  threshold  ^2-  ^^'e  begin  at  the  point  where 
evaluation  of  the  partial  derivatixe  with  respect  to  O2  differs  from  that  for  the  weight 
gradient  ^2j-  The  equation  for  the  gradient  of  the  output  neurons  threshold  becomes 

go,  =  .93  (-a-,)  (1  -  xs)  J^  U2  -  Y,  "•2rr2,.j  •  (3.31) 

Evaluation  of  the  partial  derivatixe  yields  a  constant  of  one  since  none  of  the  sum- 
mation terms  is  a  function  of  the  threshold  \alue  Oj.  Shifting  llie  sign  lerm.  ijie  final 
form  for  the  gradient  is 

(JO,  =  -(j-,W3{\--r-,).  (3.32) 

Note  that  the  equation  for  the  gradient  of  the  neuron's  threshold  \aluc  (I2  (<'C|uatiuii 
3.32)  has  the  same  form  as  that  for  the  input  weights  (/■2;  coiinectcd  Id  ilic  ouiput 
neuron  (equation  3.29)  excejjt  for  the  input  term  .r2,.  We  can  tical  llu-  llireshold 
value  as  a  weiglit  if  we  assume  tliat  the  threshold  'weight"  has  a  constant  input  of  —1. 
d.      Calculation  of  the  First  Layer  Gradients 

The  derivation  of  the  equation  for  the  gradient  of  t  he  hrst  layer  weights 
follows  in  a  similar  fashion  to  that  of  the  second  laxer.  We  begin  at  the  point  where 
evaluation  of  the  partial  derivative  differs  (equation  3.29).  The  e(|uatiuii  lor  the  first 


layer  weight  gradient,  becomes 

g\jk   =   -. =  93  (  -W3)  {  1   -  .rg) ^2  -  X^  ^i'2p^-2p      ■  (3.33) 

OWijk  OWx-ik    \  p  J 

Only  the  output  of  the  k^^^  neuron  in  the  second  layer  (x2/c)  is  affected  b\-  the  \alue  of  I 

the  weight  u;ij^-  of  the  first  layer.   Therefore  all  terms  except  for  the  A-'''  term  of  the  1 

summation  in  equation  3.33  are  zero  when  the  partial  derivative  is  taken.  'I'his  yields 
the  expression 

9\jk  =  -g-3W3{l  -  -1-3)  {-W2k)  -^ ■r2k-  (3.34)  ■ 

Using  equation  3.22  we  can  rewrite  equation  3.34  as 

gijk   =   -^3^-3(1    -.l-3)(~  ■>■■>,)  {-W2k)(\    -■'■2k}--^   f^U    -E""!'/'-'!/)    "  (3-35) 

The  first  part  of  equation  3.35  can  be  replaced  with  the  gradient  (/j/,  usiiiu  e(|uation 
3.30.  Only  the  j'''  term  of  the  sunnnation  under  evaluation  by  th(>  partial  derivatixe 
with  respect  to  u'l,/,.  is  nonzero.  The  equation  for  the  hrst  laver  gradients  of  the 
weights  then  becomes 

fhjk  =  </2A  (-i''2A)  (1  -  J-2a)  (--rij  (3.30) 

which  when  rearranged  yields 

(jijk  =  (j2k^^'2k  ( 1  -  •'■2/,- )  ■'■i.,-  (3.37) 

Again,  the  present  layers's  gradient  is  a  function  of  the  next  layer's  giadieiits  and 
weights,  the  present  layer's  neuron  output  values,  and  the  input  to  the  |)resent  layer. 

The  derivation  of  the  equation  for  tlie  gradient  associated  with  the 
neuron  thresholds  of  the  first  layer  follows  in  the  same  manner  as  that  u'i  the  second 
layer.  The  equation  for  the  threshold  gradients  On,  of  the  tirst  layer  can  be  expressed 
as 

9e,,  =  92k{-n'2k)(  1  -  ^r2k)  t^  I  ^A/,  -  Yl  '^'iv'-'i'v )  •  (3-38) 
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Evaluating  the  partial  derivative  results  in  the  final  equation 

^^u  =-^2A-«'2a(1  --^2^)  (3.39)  _ 

which  has  the  same  form  as  equation  3.32. 

e.      Calculation  of  the  Input  Layer  Gradients 

Derivation  of  the  input  layer's  gradient  equation  differs  only  slightly 
from  the  previous  development.  The  difference  is  due  to  the  fact  that  a  variation  in 
the  value  of  a  weight  in  the  in])ut  layer  affects  the  output  of  more  than  a  single  neuron 
in  the  next  la\er  of  the  network.  This  means  that  we  must  retain  a  sununation  term 
throughout  the  calculation  of  the  first  la\"er"s  gradient  equation,  riie  giaciicnl  for  1  he 
first  layer  weight  can  be  expressed  as 

yoj^  =  -^  =  ^3  (-"'3)  (1  -  .r3)  -^  U  -  E  w,,x,\  .  (3.40) 

dw^^jk  OWojk    \  p  J 

The  threshold  $2  is  not  a  function  of  the  input  la\er"s  weights  and  i-^  cliiiiiiiatcd  uIkmi 

its  partial  deri\"ative  is  taken  with  respect  to  the  input  weight  (Cu,;^.    1  he  other  terms 

under  evaluation  h\   the  j^artiai  deri\-ati\e  (i.e..  .r^p)  arc  all.  however,  a  lunction  ol 

the  input  weight   a-o^;..    The  partial  derivative  can  be  moved  inside  the  sunnnalion 

resulting  in 

(Joik  =  ^3i''3  ( 1  -  -r-s)  Yl  "'2/'^^ -^'^z  •  ^■^•^' ' ) 

p  Owojk 

Shifting  the  summation   to  the  far  left   and  evaluating  the  i)artiiil  (h'l  i\<it  i\c  using 
equation  3.22  yiekK 


90  J  k 


:  X;53lC3  (1   -  .r.s)  W2p  (-J-2p)  (1  -  -r'Zp)  -^   (^1;,  "  L  "'i.A.r,  J   .  (3.42) 

p  OWojk    \  ,j  J 

The  ^ip  term  in  equation  3.42  can  be  eliminated  since  it  is  not  a  iuncli(;ii  ol  h'ujA-. 
The  remaining  terms  can  then  be  rearranged  to  produce 

%,k  =  5I^3^'N(1  -.c:){x,,)w2,[\  -.r2,j7-^—  XI"-i7/.-'>r  (3-43) 
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The  value  g2p  can  be  substituted  for  the  first  part  of  equation  3.43  using  equation 
3.30.  Also,  the  output  of  the  A'"'  neuron  of  the  input  layer,  .ii/,..  is  a  fund  ion  of  the 
input  weight  w^jk-  As  a  result,  ex'aluating  the  partial  derivatixe  using  equation  3.22 
results  in  the  equation 

90jk  =  Y.  92pW2p  {  1   -  .T2,;)  W^p  {-Xu)  (  1  "  -^'U-)  T. ^OA-  "  Y.  ^'^^'k^O,       ■        (3.44) 

J  (^^^'Ojk    \  r  ) 

Evaluating  the  partial  derivative  in  equation  3.44  with  respect  to  it'oj/.-  we  hnd  that 
only  the  j'''*  term  of  the  summation  is  nonzero.  Rearranging  the  terms  yields 

^0.;^-    =   Hff2p^'2p  (1    -  3'2y;)  X\k}^\kp[^    "  X\k)'Viij-  (3.45) 

Finally,  we  can  replace  the  hrst  four  terms  of  equation  3.45  with  the  \alue  ry,/^.^,  using 
equation  3.37.  This  results  in  the  equation  for  the  gradients  of  the  weights  of  the 
input  layer  of  the  network 

^u.;A  =  [Y^xkp^^'xkp  j  ( 1  -  -'ia)  -''u,-  (3.46) 

Using  the  same  reasoning  used  to  derix'e  equations  3.32  and  3.39  we  can  exj^ress  the 
equation  for  the  gradient  of  the  input  layer  neuron  thresholds  a^ 

^<5o,  =  \Y.-9\kv^^\kA  (1  -  -lu)  ■  (3.47) 

Derivation  of  the  equations  for  the  gradients  associated  with  tlie  weights  and  thresfi- 
olds  of  the  neural  network  is  now  complete.  What  we  have  found  is  that  the  gradients 
for  an}'  particular  layer  of  the  network  can  be  expressed  as  a  function  of  the  gi\-en 
layer's  weights,  thresholds,  inputs,  outputs,  and  the  following  layer's  gradients.  It  is 
not  necessary  to  begin  at  the  output  of  the  network  and  use  tlie  network  output  error 
t[i)  to  calculate  the  gradient  for  a  particular  weight  or  threshold  which  is  se\eral  lay- 
ers back  in  the  network.  The  above  expressions  for  the  gradient  do.  however,  dictate 
that  the  gradient  calculations  begin  at  the  output  of  the  network  and  the  gradients 
be  propagated  back  through  the  network. 
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5.      Fibonacci  Line  Search  Parameters 

Several  parameters  associated  with  the  Fibonacci  hue  search  methods  must 
be  specified  before  the  conjugate  gradient  algorithm  described  in  this  chapter  can  be 
applied.  These  parameters  are: 

•  The  initial  size  of  the  interval  of  uncertainty 

•  The  number  of  iterations  that  the  line  search  should  perform. 

The  Fibonacci  line  search  attempts  to  find  the  best  stepsize  (o/,)  in  which  to  step 
along  the  error  function  surface  towards  the  global  minimum  in  a  direction  defined 
by  the  direction  \-ector  (d; ).  The  initial  inter\-al  of  uncertainty  is  tlic  iiil(M\al  ox'er 
which  the  algorithm  will  seaicli  for  the  optimal  stepsize  (o^).  Tlu^  initial  iiitcM\al, 
therefore,  establishes  the  minimum  and  maximum  stej^size  \'alues.  Our  gcjal  is  to  find 
the  optimal  set  of  weights  and  thresholds  by  moving  steadil\-  down  the  error  function 
surface  towards  the  global  minimum.  The  lower  bound  of  the  interxal.  or  minimum 
stepsize  value,  is  therefore  zero  since  a  negative  value  would  mo\(-  the  algcji  it  Inn  up  tlu^ 
error  function  surface  in  a  direction  opposite  the  direction  \ector  (d^.).  Selection  of  an 
upper  bound  for  the  interval  entails  a  number  of  tradeoffs.  .\  laruci  inaxiinum  \alue 
would  allow  the  algorithm  to  search  o\-er  a  greater  interval  for  the  optimal  stepsize 
(q/.-).  This  could  alhjw  the  conjugate  gradient  algorithm  tu  ccMiverge  to  the  gl(;bal 
minimum  more  quickl\-  b\"  enabling  it  to  step  farther  down  the  erior  function  surface 
at  each  iteration  of  the  algorithm.  It  could  also  ]>ossibly  ])r(j\ide  more  ])rotection 
against  being  trapjjed  in  a  local  minimum  b\'  allowing  the  line  s(^arrh  algcjiithm  to 
search  beyond  the  confines  of  a  local  minimum.  A  larger  intei\al.  liowe\er.  requires 
that  a  greater  number  of  iterations  be  performed  to  reduce  the  interval  of  uncertainty 
to  the  required  degree.  This  final  interval  of  uncertaint}-  must  be  small  so  that 
midpoint  of  the  interval  is  reasonabh'  close  to  the  optimal  stepsize  \alue.  It  is  this 
midpoint  that  is  tlir  -t'p-iz«    \alue  a^,  that   will  be  used  b\-  the  conjugate  gradient 
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algorithm  to  update  the  weights  and  thresholds  of  the  neural  network.  A  larger  final 
interval  of  uncertaint}'  increases  the  chances  of  a  less  than  optimal  choice  for  the  final 
stepsize.  A  balance  must  therefore  be  struck  between  the  size  of  the  initial  intei  \al  of 
uncertainty,  the  size  of  the  final  interval  of  uncertaint}-,  and  the  number  of  iterations 
to  be  performed. 

Initial  investigations  were  performed  to  determine  the  range  of  stepsize 
values  that  were  typical  for  various  neural  network  applications.  It  was  found  that 
the  stepsize  (q;,)  generally  did  not  exceed  a  value  of  10.0  and  was  typically  less  than 
1.0.  .An  initial  inter\^al  of  uncertainty  of  10.0  was  therefore  used  tluxnighout  icmainder 
of  the  thesis  reseaicli. 

In  the  course  of  determining  the  initial  inter\-al  of  uncertaini  \-  it  was  iound 
that  the  line  search  method  would  occasional!}"  }"ield  a  final  step  size  \aluc  (ci^  )  which 
produced  an  error  function  \alue  much  greater  than  the  pre\-ious  iteration's  \alue.  It 
was  determined  that  this  problem  was  a  result  of  the  error  function  suilace  not  being 
unimodal  in  the  direction  (d;,)  along  which  the  algorithm  searched  lor  the  nuninuim. 
If  this  second  minimum  wa^  closer  to  one  of  the  two  exalualiou  points  (A/,,  and  fi^) 
than  the  true  minimum,  as  shown  in  figure  3.5.  then  the  algorithm  wi)uhl  conx'erge  to 
this  second  minimum.  This  would  result  in  an  error  function  \alue  larger  than  when 
the  line  search  algorithm  started.  To  remedy  this  problem,  the  initial  iiit(T\al  of 
uncertaint}-  was  shifted  to  the  left  so  that  the  first  point  e\aluated  was  for  Au  =  0.  If 
the  error  function  for  the  final  stepsize  value  (q;. )  was  greater  tiian  the  erior  function 
value  with  a  stepsize  of  zero,  then  a  stepsize  of  zero  was  returned  as  the  final  stepsize 
value  (oa).  This  had  the  effect  of  resetting  the  conjugate  gradient  algcMithm.  A 
stepsize  of  zero  caused  the  algorithm  to  retain  the  same  weights  and  tliresholds  for 
the  next  iteration  of  the  algorithm.  As  a  result,  the  gradient  (g/,+  i)  at  the  next 
iteration  was  identical  to  the  previous  gradient  (g;^.)  and  the  two  successive  identical 
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gradients  would  produce  a  deflection  coefficient  (/?/t)  equal  to  zero.  This  would  reset 
the  direction  vector  {d/,)  to  the  value  of  the  present  gradient  (g/,)  rather  than  the 
weighted  sum  of  previous  gradients.  This  had  the  effect  of  reinitializing  the  conjugate 
gradient  method,  but  at  a  new  starting  point  (hjt)  on  the  error  function  surface. 


a^.  A^.  ftk  \n 

Figure  3.5:   Line  profile  of  the  error  function  surface 

Having  fixed  the  initial  interval  of  uncertaiut\',  the  number  ol  iterations  of 
the  line  search  algorithm  performed  during  each  iteration  of  the  ( oiijiigate  gradieiit 
method  was  varied  to  detciniine  an  o]>timal  numb(M.  Using  sixteen  itrralions,  tlio 
conjugate  gradient  algorithm  was  able  to  consistently  reduce  the  value  of  the  error 
function.  The  value  of  the  error  function  did  not  consistently  droj)  when  fewer  than 
sixteen  iterations  were  used.  Using  equation  3.9  this  resulted  in  a  final  interxal  of 
uncertaintv  of  0.0UG2G. 
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C.      COMPUTER  PROGRAM  IMPLEMENTATION 

1.      Conjugate  Gradient  Algorithm 

The  conjugate  gradient  algorithm  was  implemented  for  a  multiple  in]:)ut, 
single  output  neural  network  using  the  C  programming  language.  A  flow  chart  show- 
ing the  basic  functions  that  are  ])erformed  by  the  program  is  shown  in  figure  ."i.Ci.  The 
user  is  prompted  at  the  start  of  the  program  for  tlie  number  of  neurons  in  each  stage 
of  the  neural  network,  the  number  of  iterations  of  the  conjugate  gradient  algorithm 
that  should  be  performed,  and  the  name  of  the  input  file  that  contains  the  training 
data  that  the  algorithm  will  use  to  adapt  the  weights  and  thresholds  oi'  the  lu^work. 
The  number  of  neurons  allowed  in  the  network  is  limited  to  a  total  ol  ot)  and  the 
number  of  weights  connecting  the  neurons  is  limited  to  51)0.  This  nia.ximuni  number 
of  neurons  and  weights  was  more  tlian  laige  enough  for  the  \aricMis  prc^blcms  to  which 
the  program  was  aj^plied.  The  training  data  file  consists  of  columns  of  data  in  which 
each  column  is  associated  with  an  input  to  the  neural  network  e.\cc]jt  lt;r  the  the  last 
column.  The  last  cokmni  is  the  desired  output  of  the  neural  net\v(;rk.  Each  row  is  a 
separate  training  data  set.  Tpon  completion  of  the  juograiii  three  hies  are  picnliued. 
The  first  is  a  file  that  contains  the  final  results.  The  first  colunni  of  the  fih^  is  the 
desired  value  and  the  second  column  is  the  \alue  that  the  neural  netwoik  produced 
using  the  final  weights  and  thresholds  of  the  network.  If  the  alg(jritlnn  ha.-  performed 
as  expected  and  reduced  the  error  function  to  a  small  \-alue.  then  the  two  columns  of 
data  should  be  nearly  identical.  The  second  output  file  ])roduced  contains  tlie  final 
weights  and  thresholds  of  the  network.  This  file  can  then  be  u.sed  b\-  an\-  other  j^ro- 
gram  which  simulates  the  operation  of  a  neural  network  with  the  same  configuration 
of  neurons.  The  final  file  is  produced  only  if  the  neural  network  has  two  inputs.  The 
file  consists  of  a  21  x  21  matrix  of  neural  network  output  \alues  that  were  ])icKluced 
by  appl\-ing  a  sequence  of  twent\-one  evenh'  s])aced  values  betwee]i  0.0  and   1.0  to 
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each  of  the  two  inputs.  The  resulting  file  can  be  used  to  produce  a  three  dimensional 
mesh  of  the  output  surface  of  the  neural  network.    Examples  of  the  input   screen, 
output  screen,  and  both  the  input  and  output  files  are  contained  in  Appendix  A.  A  ' 
copy  of  the  C  program  source  code  is  contained  in  Appendix  B. 
2.      Backpropagation  Algorithm 

In  order  to  evaluate  the  conjugate  gradient  algorithm's  performance,  the 
backpropagation  method  was  also  implemented.  The  basic  flow  chart  for  the  back- 
propagation  method  is  shown  in  figure  3.7.  Because  of  the  similarity  between  the  con- 
jugate gradient  method  and  the  backpropagation  methods,  this  required  only  a  few 
changes  to  the  program  that  implemented  the  conjugate  gradient  algorithm.  These 
changes  consisted  of 

•  Replacing  the  stepsize  value  (cu)  calculated  by  the  Fibonacci  line  search  with  a 
user  specified  constant  referred  to  as  the  learning  rale  by  the  backpr()i)agalion 
method. 

•  Replacing  the  deflection  coefficient  (-4)  which  is  calculated  for  e\(M\  iteration 
of  the  algorithm  with  a  user  specified  constant  referred  to  as  the  nionientum 
factor  by  the  backpropagation  method. 

•  Updating  the  weights  and  thresholds  of  the  neural  network  after  the  ai)plicalion 
of  each  training  data  set  rather  than  upon  completion  of  a  complete  j>ass  t  jirough 
the  entire  training  data  file. 

The  input  and  output  files  remain  the  same  as  those  for  the  conjugate  gradient  \(Msion 
of  the  program. 

The  following  chapter  compares  the  performance  of  the  conjugate  gradient 
and  backpropagation  algorithms  and  also  presents  the  results  of  several  neural  network 
application>. 
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Figure  3.6:   Conjugate  gradient  algoritlim  flowchart 
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Figure  3.7:  Backpropagation  algorithm  flowchart 


IV.  RESULTS 

In  this  chapter,  the  results  of  the  research  conducted  on  neural  networks  using 
the  conjugate  gradient  method  are  presented.  The  chapter  is  divided  into  two  parts. 
The  first  concerns  the  performance  of  the  conjugate  gradient  algorithm  compared 
to  that  of  the  backpropagation  method.  The  second  provides  several  examples  of 
neural  network  applications.  Where  possible,  the  performance  of  the  neural  network 
is  compared  to  its  linear  counterpart. 

A.      CONJUGATE  GRADIENT  ALGORITHM  PERFORMANCE 
1.      Performance  Measures 

The  rationale  for  implementing  the  conjugate  gradient  algorithm  was  to 
de\^elop  an  alternati\e  to  the  backpropagation  method  that  wcMihl  coiixcrgc  iiujic 
quickly  to  the  optimal  set  of  weights  and  thresholds  for  a  gi\eii  pn^Mciii.  The  vivuv 
function  (E)  is  a  measure  of  whether  the  weights  and  thresholds  ol'  a  neural  netwcjrk 
are  optimum  when  applied  to  a  particular  problem.  The  smaller  the  error  function 
value,  the  more  nearly  optimum  the  weights  and  thresholds  are.  lioth  algorithms 
reduce  the  value  of  the  error  function  by  iterativeh'  adajjting  th(>  \\(Mghts  and  thresh- 
olds of  the  neural  network.  The  rate  at  which  the  backpro|>agat  ion  and  conjugate 
gradient  algorithms  converge  to  the  optimal  set  of  weights  and  threshoid.s  can  be 
measured  using  several  methods.  The  simplest  approach  would  be  to  determine  the 
number  of  iterations  each  algorithm  requires  to  reduce  the  value  of  the  error  function 
to  a  prescribed  leN-el.  The  number  of  iterations  for  each  algoritlim  would  then  be 
compared  and  the  algorithm  requiring  fewer  iterations  would  be  considered  to  con- 
verge more  quickl}-.    This  approacli  does  not.  howe\'er.  take  into  accijunt   the  great(>r 
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computational  complexity  of  the  conjugate  gradient  method.  A  more  accurate  mea- 
sure of  performance  for  the  purposes  of  comparison  is  the  number  of  multiplications 
performed  by  each  algorithm.  This  measure  better  reflects  the  relatix'e  com])utational 
requirements  of  the  two  algorithms.  The  number  of  multiplications  performed  b\-  each 
of  the  methods  over  one  iteration  is  fixed.  We  can  therefore  calculate  a  multiplication 
ratio  of  the  two  methods  and  then  use  this  ratio  in  conjunction  with  the  number  of 
iterations  to  compare  their  relative  performance. 

2.      Calculation  of  the  Multiplication  Ratio 

1  he  number  of  multiplications  performed  b\'  both  the  backpru]:)agation 
method  and  the  conjugate  gradient  method  o\cr  one  iteration  is  a  function  of  sexcral 
variables.  These  include  the  number  of  neurons  and  weights  in  the  network,  the  size 
of  the  training  data  hie  used  to  train  the  network,  and  the  number  of  iterations  pei- 
formed  by  the  Fibonacci  line  search  method.  Tables  4.1  and  4.2  show  the  number  of 
multiplications  required  b\-  \-arious  functions  of  the  conjugate  gradient  and  backino])- 
agation  method,  respectively.  The  tables  also  show  the  total  number  of  limes  each 
function  is  performed  during  a  single  iteration  of  the  algorithiii.  1  he  \ariabK'  7  is  \\\c 
number  of  training  data  sets  used  to  train  the  network,  the  xariable  /''  is  the  nunibei 
of  weights  and  thresholds  in  the  network,  and  R  is  the  number  of  neurons  in  the 
network.  Table  4.1  hgures  reflect  that  the  step  size  (q/,.  )  is  calculated  using  sixteen 
iterations  of  the  Fibonacci  line  search  algorithm.  The  total  number  of  multiplications 
(M)  performed  b}-  each  of  the  algorithms  is  therefore 

McG  =  7'(20P  +  37f^+  17)  +  21(P-f /t')  +  :i5  (4.1) 

for  the  conjugate  gradient  method  and 

MBP  =  T{5P-\-r>R)  (4.2) 
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TABLE     4.1: 
METHOD 


MULTIPLICATIONS 


CONJUGATE     GRADIENT 


Function 

Times 
Performed 

Number  of 
multiplies 

Calculate  network  output 

isr 

P  +  2R 

Calculate  gradient  vector 

r 

■IP  +  n 

Calculate  deflection  coefficient 

1 

•2P  +  2Jx 

I'pdate  direction  vector 

1 

P  +  R 

Calculate  step  distance 

1 

35 

Update  weight  vector 

18 

F  +  P 

Calculate  error  suin 

17 

M 

TABLE  4.2:  MULTIPLICATIONS  -  BACKPROPAGATION  METHOD 


Function 

Times 
Performed 

Number  of 
multiplies 

Calculate  network  output 

T 

P  +  2/? 

Calculate  gradient  vector 

r 

2P  +  R 

Update  direction  vector 

T 

P  +  R 

Update  weight  vector 

T 

P  +  R 

for  the  backpropagation  method.     We  can  then  derive  the  multiplication  ratio  by 
dividing  Mcg  by  Mbp  to  obtain 

-Ubp  r(5/^  +  57?)  ■  ^   "  ' 

Equation  4.3  can  then  be  factored  into  four  terms  as  shown  below 

RATIO  =  4  +  iM±i)  ^  ^  +  _J^_.  (4.4) 

For  the  purposes  of  approximation,  the  last  two  terms  of  equation  4.4  can  be  elim- 
inated since  the  number  of  training  data  sets  used  to  train  the  neural  network  is 
typically  large.  As  the  number  of  neurons  in  a  network  is  increased,  the  numl)er  of 
connections  or  weights  in  the  network  increases  at  a  much  greater  rate.  1  his  hapj)ens 
because  each  neuron  in  a  given  la\'er  is  connected  to  ever\'  neuron  in  the  ne.xt  layer 
of  the  network.  As  a  result,  the  second  term  of  equation  4.3  steadily  decreases  as 
the  number  of  neurons  is  increased.  The  lower  bound  on  the  multiplicatiiJii  ratio  is 
therefore  appro.ximately  four  and  the  upper  bound  can  be  set  at  a])])rc)ximatel\-  ji\e 
for  networks  ha\ing  mure  than  just  a  few  neurons. 
3.      Performance  Results 

The  performance  of  the  conjugate  gradient  method  was  conipaicd  to  the 
performance  of  the  backpropagation  method  using  two  different  training  proldcnis. 
The  first  consisted  of  training  the  neural  netw-ork  to  produce  a  binaix'  out  |)ut  of  either 
one  or  zero  depending  on  the  inputs  to  the  network.  The  second  problem  iiuohed 
training  the  neural  network  to  produce  a  specific  value  within  the  range  of  zero  to 
one  for  a  given  set  of  inputs  to  the  network. 

A  plot  of  the  normalized  value  error  function  versus  the  numf^er  of  itera- 
tions performed  for  the  binary  ]noblem  is  pictured  in  Figure  4.1  loi  the  backprcjpa- 
gation  algorithm  and  in  Figure  4.2  for  the  conjugate  gradient  algoiithrn.    Note  the 
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difference  in  the  horizontal  scale  of  the  two  figures.  The  error  function  steadily  de- 
creased for  the  conjugate  gradient  method  while  the  error  function  actuall}'  increased 
for  approximately  the  first  100  iterations  of  the  backpropagation  algorithm.  .Also 
note  that  the  error  function's  rate  of  change  was  much  more  even  for  the  conjugate 
gradient  algorithm  than  for  the  backpropagation  method. 

In  order  to  compare  the  relative  performance  of  the  two  algorithms,  the 
multiplication  ratio's  upper  bound  of  five  was  used.  Pictured  in  f-'igure  4.3  is  a 
comparison  of  the  two  algorithms'  convergence  rates  with  respect  to  the  aj^pro.ximate 
number  of  multiplications  performed  by  each  algorithm.  .As  can  be  seen,  for  the  l:)inary 
case,  the  conjugate  gradient  method  consistently  outperformed  the  backprujjagation 
method  for  any  given  nuniljer  of  multiplications  performed. 

1  he  results  were  even  more  apj)arent  for  the  continiKuis  (output  |)roblein. 
The  backpropagation  method  was  unable  to  significantly  reduce  tlie  error  fun(ti(Mrs 
value  for  the  first  500  iterations  of  the  algorithm  as  is  shown  in  Figure  4.4.  The  conju- 
gate gradient  method,  however,  steadily  reduced  the  value  of  the  error  function  \-alue 
after  each  iteration  of  the  algorithm  (Figure  4.-")).  C'onijjarison  of  the  con\'ergence 
rates  of  the  two  methods  with  respect  to  the  number  of  multiplications  required  in 
each  case  is  shown  in  Figure  4.6.  For  an\-  given  number  of  multiplications  the  conju- 
gate gradient  method  greatly  outperformed  the  backpropagation  metliod. 

The  conclusion  from  the  two  examples  above  is  that  the  conjugate  gradient 
method  performs  as  well  or  better  than  the  backpiopagation  method  with  respect  to 
both  the  number  of  iterations  and  the  number  of  multiplications  required  to  reduce 
the  error  function  to  a  desired  lexel.  The  conjugate  gradient  method  therefore  satisfies 
one  goal  of  this  thesis  which  was  to  develop  an  alternative  to  the  backpropagation 
method  that  would  converge  more  quickly  to  the  optimal  set  of  weights  and  thresholds 
for  any  given  problem. 
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Figure  -4.1:   Binary  problem  -  backpropagatioii 
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Figure  4.2:   Binary  problem  -  conjugate  giadient 
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Figure  4.4:   Continuous  {jroljleni  -  backpi  opagation 
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Figure  4.5:  Continuous  problem  -  conjugate  gradient 
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Figure  4.6:   Continuous  problem  -  comparison 
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B.     NEURAL  NETWORK  APPLICATION  RESULTS 

Several  simple  applications  were  chosen  to  evaluate  the  perionnance  of  the  con- 
jugate gradient  metliod  vis-a-vis  tlie  backpropagation  method.  These  a])plications 
were  also  used  to  develop  a  better  understanding  of  the  potential  signal  processing 
applications  for  the  neural  network.  When  possible,  the  neural  network's  performance 
was  compared  to  its  linear  counterpart. 

1.      A  Classification  Problem 

The  goal  for  this  problem  was  to  train  a  neural  network  to  differentiate 
between  two  classes  of  inputs.  The  two  classes  of  inputs  consisted  of  points  which 
fell  either  inside  or  outside  of  a  circle  with  a  diameter  of  0.5  centered  within  in  a 
unit  square  as  shown  in  Figure  4.7.  This  classihcation  problem,  although  relatixeh* 
simple,  is  representative  of  one  of  the  primary  tasks  to  which  neural  networks  have 
been  applied — pattern  recognition  and  classihcation  [Kef.  l:p|).  (•(•  (»7]. 

The  points  used  to  construct  the  training  data  file  were  e\('nl\-  spaced  0.1 
apart  from  zero  to  one  for  both  the  Ay  and  A'l  coordinates  as  shown  in  Figure  4.7. 
This  produced  a  total  of  121  ]>oints  o\-er  the  unit  square.  The  training  data  file 
Wcis  composed  of  121  data  sets,  each  set  consisting  of  the  coordinates  lor  one  of  the 
training  points  and  a  \alue  re|)resenting  the  desired  class  to  wliic  h  tin-  poini  belonged. 
The  desired  value  loi  a  |;oint  falling  inside  the  circle  was  a  one.  llie  tloired  \alue 
for  a  point  falling  outside  the  circle  was  a  zero.  The  conjugate  gradient  algorithm 
Wcis  used  to  train  a  neural  network  which  had  two  inputs,  eight  first  layer  neurons, 
four  second  layer  neurons,  and  one  output  neuron  (a  2-8-4-1  configuration).  Ahev 
100  iterations  of  the  algorithm,  the  total  squared  error  summed  o\er  the  entire  121 
training  data  sets  was  6.2G  x  10"".  The  resulting  output  of  the  neural  netwoik  as  a 
function  of  its  in|)uts  is  pictured  in  Figure  4.S. 
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Figure  4.7:   Training  data  for  the  classification  problcn 


The  neural  network  produced  output  values  ranging  from  1.56  x  10"'^  to 
1.0  and  was  able  to  properly  identify  the  class  to  which  each  of  the  training  data 
points  belonged.  The  contour  plot  of  the  neural  network  output  for  a  single  contour 
value  of  0.5  is  shown  in  Figure  4.9.  The  plot  clearly  shows  that  the  conjugate  gradient 
algorithm  was  able  to  calculate  a  set  of  weights  and  thresholds  for  the  neural  network 
that  very  closely  approximates  the  desired  result.  A  circular  decision  region  was 
formed  that  allowed  the  neural  network  to  differentiate  between  points  falling  inside 
the  circle  and  points  falling  outside  the  circle.  This  is  because  a  neural  network,  due 
to  its  nonlinearitw  ha.>  tlie  abilit\   lo  form  arbitrarily  coni])lex  decisiun  regions. 

This  simple  example  clearl\-  demonstrates  the  abililx'  of  a  neural  network 
to  produce  a  nonlinear  mapj^ing  of  a  set  of  analog  inputs  to  a  single  Inuary  output 
value.  In  this  case,  this  nonlinear  mapjiing  was  used  to  produce  the  two  decision 
regions  pictured  in  Figure  1.9.  For  other  ap])licat ions,  the  formation  of  dec  isicni 
regions  may  not  be  called  for.  Hatlier.  the  out[>ut  of  tlie  network  nia\  liaxc  to  be 
continuously  varialjle. 

2.      Nonlinear  Time  Series  Prediction 

The  previous  problem  required  the  neural  network  to  i)roduce  onl\-  a  binary 
output  of  one  or  zero.  The  second  application  was  selected  so  that  the  conjugate 
gradient  algorithm's  |)erformance  could  be  evaluated  for  the  ease  ol  a  ( out  inuously 
variable  range  of  desired  outjjul  values.  This  ty])e  of  applicati(jn  falls  \\\\u  a  second 
class  of  tasks  for  which  the  neural  network  can  be  applied — nonlinear  mapping  of  a 
set  of  analog  inputs  to  an  analog  output  value  [Ref.  1:|).  67].  It  was  decided  to  appl\- 
the  neural  network  to  the  problem  of  one-step  prediction  of  a  nonlinear  lime  series. 
One-step  prediction  is  a  faiily  conuiion  application  in  digital  signal  piorc^ssing.  A 
nonlinear  time  series  was  used  since  one-step  prediction  for  a  linear  time  series  could 
easily  be  satisfied  using  a  lineai   filter  rather  than  a  neural  netwcjik.     I  he  meihod 
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Figure  4.8:   Neural  network  output  versus  in])ut 


Figure  4.9:  Neural  network  output  contour  plot 


used  to  perform  the  prediction  is  similar  to  that  used  by  a  hnear  predictor.  The 
next  value  in  the  series  is  predicted  using  the  previous  values  of  the  series.  The  basic 
configuration  is  pictured  in  Figure  4.10. 
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Figure  4.10:  Time  series  predictor 


For  a  linear  predictor  the  output  of  the  predictor  is  iiierclx  a  \\ci<;ht('fl  suiii 
of  a  given  number  of  i)re\ious  \-alues  of  tlie  series.  The  neural  n<M\vork.  luA\c\cr.  can 
produce  an  output  which  is  a  uonHnear  function  oi  a  gi\en  number  (»1  |>ic\iuiis  vahics. 
The  nonlinear  time  series  used  to  train  and  evaluate  the  conjugate  gradienl  algorithm 
was  produced  using 

.r,  +  ,  -45.r,(l  -.rj.  (4.5) 

This  equation  is  referred  to  as  the  classic  logistic  or  Feigeni)aum  map  and  has  l^ccii 
studied  quite  extensively-  because  its  simplicit\-  and  its  a])plicati()n  lo  chaos  theor\'. 
This  iterated  equation  (equation  4.5)  produces  an  ergodic.  chaotic  time  >crics  that  is 
bounded  and  quasi-periodic  [Ref.  12:p.  10].  A  training  sequence  of  100  samples  was 
generated  using  equation  4.5  witli  the  \ariable  B  equal  to  1.0.  This  src|uence  was 
then  used  to  adaptively  calculate  the  optimal  coefficients  for  a  linear  second  oider 
prediction  filter  using  a  recursive  least  squares  method.    The  linear  picdictoi  "s  results 


are  pictured  in  Figure  4.11.  Only  the  first  fifty  samples  of  the  sequence  weie  plotted 
so  that  the  two  curves  on  the  graph  could  be  better  differentiated.  It  is  ob\-ious 
from  Figure  4.11  that  the  linear  predictor  was  unable  to  accuratel\'  i)redict  the  ne.xl 
value  in  the  nonlinear  series  using  the  two  previous  values  of  the  series.  When  the 
difference  between  the  the  actual  and  predicted  signals  is  plotted  one  can  see  that  the 
magnitude  of  the  error  is  almost  as  great  as  the  magnitude  of  the  original  signal  (see 
Figure  4.12).  As  was  expected,  the  linear  predictor  performs  poorl\'  for  a  nonlinear 
problem. 

The  same  training  sequence  was  then  used  l:)y  the  conjugate  gradient  al- 
gorithm to  train  a  neural  network  with  a  2  4  2  1  conliguralion.  Ihe  network  was 
trained  to  predict  the  ue.xt  \ahu'  of  the  series  based  on  the  two  j^rexious  \alues.  Al- 
ter 100  iterations,  the  sum  of  the  squared  errors  over  the  100  training  data  sets  was 
7.25  X  10~^.  This  would  equate  to  an  average  standard  dexiation  from  the  actual  sig- 
nal of  appro.ximatel}-  8.51  x  10"'.  The  neural  network's  results  aic  pictuiccl  in  Figuie 
4.13.  It  is  ap])arent  that  tlie  neural  network  performed  much  IxMtcr  than  the  linear 
predictor.  The  prediction  enor  for  the  neural  network  is  pictured  in  k'igure  4.14.  'i  he 
magnitude  of  the  neural  network's  prediction  error  is  much  smaller  than  tliat  for  the 
linear  predictor.  This  error  could  also  be  reduced  e\-en  further  if  addilional  iterations 
of  the  conjugate  gradient  were  ]>erformed. 

This  example  demonstrates  that  a  neural  network  is  quite  capable  of  per- 
forming nonlinear  mapping  of  a  set  of  analog  inputs  to  an  analog  out|)ul.  1  he  neural 
network  can  also  produce  more  accurate  results  than  the  linear  apiMoacli  when  the 
problem  to  be  solved  is  nonlinear.  It  must  be  recognized.  howe\-er.  that  although 
the  neural  network  produces  more  accurate  results,  it  is  much  more  computationally 
complex  than  the  linear  approach  to  the  j)roblem. 
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Figure  4.11:   Linear  predictor  output  and  actual  signal 
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Figure  4.13:  Neural  network  predicted  and  actual  signal 
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3.      Channel  Equalization 

One  final  example  will  serve  to  demonstrate  the  potential  ap])lications  for 
the  neural  network.  The  idea  of  using  a  neural  network  to  perform  channel  equaliza- 
tion for  a  nonminimum  phase  transmission  channel  was  borrowed  from  Gibson.  Siu, 
and  Cowan  [Ref.  15].  The  experimental  results  indicate  that  a  neural  network  could 
potentially  provide  superior  performance  to  its  linear  counterpart  when  the  channel 
over  which  the  digital  data  is  transmitted  is  nonminimum  phase. 

a.      Transmission  Channel  Model  and  Equalizer  model 

When  digital  data  is  transmitted,  it  frequeiitl\-  l^ecomes  distoited  1)\" 
the  channel  over  which  it  travels.  This  distortion  can  freciuentl\-  l)e  modeled  using 
a  linear  time  invariant  (LTI)  system  [Ref.  S:p.  426].  llic  (li.imicl  iiioild.  shown  in 
Figure  4.15.  consists  of  the  transfer  function  H{z)  and  a  cliaiino!  noise  Iimhi  ;/,.    llic 


channel 


Hi: 


Figure  4.15:   Channel  model  and  equalizer 

transfer  function  of  the  channel  is  defined  b\-  a  finite  impulse  response  (FIH  )  ('(juation 

/yU)  =  ao  +  ajc"'  + h  a;,.:"'.  (4.0) 

The  channel  noise  terin  //,  is  t\picall\'  assumed  to  be  zero  mean.  a<Miti\c  white 
gaussian  noise.  I  he  ])urpose  of  a  channel  equalizer  nl^i)  shown  in  liguic  4.15  is  to 
reverse  the  distorting  effects  of  the  channel  and  to  reco\-er  the  original  signal  {x, )  using 
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777  samples  of  the  received  signal,  y,,  7/,_i. . . . ,  ?/,_,„+i.    If  we  assume,  for  a  moment, 

that  the  noise  term  (77,)  is  zero,  then  the  received  signal  ?/,  is  merel\-  a  weighted  sum 

of  the  present  and  past  values  of  the  original  signal  a:,.  This  can  be  exiMcssed  as 

k 
y,  =  X^aja-,_,  (4.7) 

j=0 

where  Gj  are  the  k -{- 1  coefficients  associated  with  the  channel  transfer  function  H[z). 
For  a  binary  signal(±l),  therefore,  the  received  signal  y,  can  assume  only  one  of  2''" 
possible  values.   If  we  then  try  to  estimate  the  original  signal  x,  using  an  777  sample 

vector  [7/,,  7/,_i 7/i_„,^i].  we  can  only  form  a  fi.xed  number  of  permutations  of  the 

received  signal  vector.  Each  received  signal  vector  [(/,. //,_i. .  . .  . //,_,„  +  i  J  belongs  to 
either  the  set  of  vectors  corresponding  to  a  transmitted  binar\-  one  (  +  1  )  oi'  liie  set  of 
vectors  corresponding  tu  a  transmitted  binary  zero  (  — 1  j.  llic  channel  (■(|uali/er  j^ro- 
duces  an  estimate  of  the  transmitted  signal  .;■,  b\-  determining  which  set  the  recei\'ed 
signal  vector  belongs  to.  It  has  been  shown  that  a  linear  lrans\ersal  e(iualizer  can 
perform  such  an  operation  if  the  channel  transfer  tuncti(Mi  //(;)  is  miniiiium  i»liase 
[Kef.  15:p.  11S4].  If  the  channel  transfer  function  is  not  minimum  i)liase.  iIkmi  the  two 
received  \-ector  sets  are  not  linearlx'  separable  and  a  linear  equalize]-  caiiiiot  acciuately 

estimate  .7-,  based  on  the  receix'ed  data  x'ector  set  [//,./y,_i //,_,,,  ^1].    If  a  delay,  d. 

is  introduced  in  the  calculation  of  .7-,.  such  that  the  at  the  /'*'  iteration  the  e(|ualizer 
estimates  the  original  signal  a",_j.  then  accurate  estimation  oi  the  original  signal  can 
be  achieved  [Ref.  15:p.  11S4].  This  value  for  d  however,  may  not  be  known,  or  may 
vary  with  time.  The  result  is  that  a  linear  transversal  equalizer,  even  with  a  delay, 
may  not  be  able  to  satisfactorily'  equalize  a  nonminimum  phase  channel. 
b.      A  Nonminimum  Phase  Channel  Equalizer 

The  ability  of  a  neural  network  to  form  arbitrar\-  decision  regions, 
demonstrated  in  Chajjter  II.  could  possibly  remedy  this  jMobleni.    io  in\"estigate  this 

60 


concept,  the  first  order  nonminimum  phase  transfer  function  {H{:)  =  0.5  +  z~^)  was 
used  to  e\aluate  the  performance  of  both  a  neural  network  and  a  Hnear  trans\ersal 
equahzer.  The  possible  values  for  y,  using  this  transfer  function  are:  +1.5.  +0.5.  —0.5, 
and  —1.5.  A  two  input  neural  network  and  two  input  linear  trans\ersal  equalizer  were 
used  since  the  channel's  transfer  function  was  only  first  order,  and  this  allowed  a 
graphical  analysis  of  the  problem.  The  eight  possible  combinations  of  y,  and  i/,_i  are 
shown  in  Figure  4.16.  The  symbol  x  indicates  that  the  original  signal  .r,  had  a  value 
of  —1  and  the  symbol  o  indicates  that  x,  was  equal  to  +1.  Notice  thai  the  symbols 
are  intermi.xed  such  that  no  single  line  can  be  drawu  that  will  (uiiipictelv'  separate  the 
two  clcisses  of  s\'mbols.  This  is  what  makes  the  nonminimum  phase  case  intractable 
for  the  linear  transversal  equalizer.  If  the  noise  term.  rt,.  is  now  incorporated  into 
the  problem,  the  result  is  as  shown  in  Figure  4.17  for  a  signal-to- nois(>  ratio  (SNK) 
of  10  dB.  The  number  of  possible  values  for  y,  becomes  infinite,  but  the  ptMnts  are 
distributed  about  the  original  eiglit  j^oints  shown  in  Figure  4.  Id.  J  lie  cuellicients 
for  a  first  order  linear  trans\ersal  equalizer  were  calculated  Ijy  appl\ing  a  recursive 
least  squares  (RLSj  algorithm  to  the  set  of  500  consecutixe  \alues  of  //,,  pictured  in 
Figure  4.17.  The  values  for  y,  were  generated  by  using  a  random  sequence  of  +1  and 
—  1  for  J-,,  applx'ing  thi>  binar\-  sequence  to  the  transfer  fuiutiuii  given  abcnc.  and 
adding  a  normall\-  distributed  noise  term  with  a  standard  deviaticjn  ecinixalent  to  a 
signal-to-noise  ratio  of  10  dli.  Tlie  linear  transversal  equalizei  "s  two  decision  regions 
are  pictured  in  Figure  4.16.  The  region  that  is  shaded  with  dots  is  the  area  k)i  which 
the  linear  transversal  eciualizer  produced  an  estimate  of  +1  for  .c,  and  the  unshaded 
region  where  the  equalizer  produced  an  estimate  of  —1  for  a,.  Note  that  the  best  that 
the  linear  equalizer  could  do  was  to  define  two  decision  regions  such  that  three  of  the 
four  possible  points  fell  within  the  projjer  region.  The  same  500  \alue  data  set  was 
then  used  to  train  a  neural  network  ha\ing  a  2  (»    1    1   ( onlignralion.     The  decisicjii 
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regions  formed  by  the  neural  network  after  100  iterations  of  the  conjugate  gradient 
algorithm  are  pictured  in  Figure  4.19.  The  neural  network,  because  of  its  ability 
to  account  for  the  nonlinearities,  was  able  to  form  two  separate  decision  regions  for 
each  of  the  two  possible  values  for  a,.  The  four  decision  regions  properly  encompass 
the  eight  possible  points  associated  with  y,  and  t/,_i.  As  a  result,  the  total  number 
of  errors  produced  over  the  500  value  training  set  dropped  from  151  for  the  linear 
equalizer  to  65  for  the  neural  network.  The  neural  network's  abilit\-  to  form  more 
complex  decision  regions  allowed  it  to  more  accurately  perform  equalization  when  the 
transfer  function  was  nonminimuni  ])lias('. 

c.      A  Nonniinimum  Phase  Channel  Equalizer  Using  a  Delay 

ll  was  stated  eailier  that  introduction  of  a  dela\-  (I  could  allow  the 
linear  equalizei-  to  more  accurately  perform  its  equalization  function.  Pictured  in 
Figure  4. 20  aie  the  eight  possible  points  associated  with  i/,  and  //,_i  fur  a  delay  of 
one  sample  (i.e..  the  estimate  of  .r,_i  based  on  the  samples  y,  and  /y,_i).  The  two 
classes  of  points  are  no  longer  intermixed  as  they  were  for  the  case  of  no  dela^'.  A  set 
of  coefficients  for  the  linear  equalizer  can  therefore  be  calculated  that  will  i)roperly 
separate  the  two  sets  of  points.  With  noise  added,  however,  the  sets  of  points  begin  to 
intermix  as  shown  in  Figure  4.21  for  a  signal-to-noise  ratio  of  10  dB.  The  separation 
of  the  two  classes  becomes  more  difficult  particularly'  for  the  linear  equalizer  which 
can  only  use  a  single  line  to  define  the  decision  boundar\-.  The  coefficients  for  the 
linear  equalizer  were  again  calculated  using  the  RLS  algorithm  and  the  500  values 
for  y,  pictured  in  Figure  4.21.  The  resulting  decision  regions  are  shown  in  Figure 
4.22.  Comparison  of  the  two  decision  regions  with  the  original  training  data  (Figure 
4.21)  indicates  that  the  linear  equalizer  was  unable  to  define  a  single  line  that  could 
separate  all  the  points  into  their  proper  regions.  The  linear  equalizer  ])roduced  a 
total  of  19  errors  over  the  500  values  of  the  training  data  set.  The  same  training  data 
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set  was  then  used  to  train  a  neural  network  with  a  2-6-4-1  configuration  using  the 
conjugate  gradient  algorithm.  After  twenty  iterations,  the  neural  network  produced 
the  two  decision  regions  pictured  in  Figure  4.23.  The  boundar\'  between  the  two 
decision  regions  is  no  longer  a  straight  line  but  is  shaped  to  take  into  account  the 
distribution  of  points  caused  by  the  introduction  of  noise.  The  neural  network  only 
produced  a  total  of  3  errors  over  the  500  value  training  set. 
d.      A  Performance  Comparison 

The  results  from  the  two  above  examples  would  tend  to  indicate  that 
a  neural  network  can  produce  sujjerior  results  to  the  linear  equalizei  bolli  when 
a  delay  is  introduced  and  when  a  delay  is  not  introduced.  In  order  to  confirm  this 
result,  the  performance  of  both  the  linear  trans\ersal  ecjualizei-  and  the  neural  network 
were  evaluated  for  \arious  signal-to-noise  ratios.  The  measure  of  performance  for 
the  test  was  the  average  bit  error  probability.  The  four  signal  to-noise  ratios:  -5.0 
dB.  10  dB.  2U  dB.  and  25  dB  were  used  to  generate  four  diHereni  sets  of  training 
sequences.  Each  sequence  was  generated  using  a  different  signal-to-noise  ratio.  Both 
the  linear  equalizer  and  the  neural  network  were  then  trained  using  tliesc  four  500 
value  sequences  for  j/,.  After  calculating  the  coefficients  for  the  linear  equalizer  and 
the  weights  and  thresliolds  for  the  neural  network  the  bit  error  peifonnancc  ol  each 
type  equalizer  was  calculated  by  passing  the  same  100.000  bit  sequence  thiough  each 
equalizer  and  counting  the  number  of  times  the  equalizer  produced  an  error.  The 
results  for  the  case  where  no  dela\  was  used  is  shown  in  Figure  4.24.  As  was  e.xpected. 
the  bit  error  probabilit\  for  the  linear  equalizer  with  no  delay  was  e.xtremely  poor.  1  he 
bit  error  probability  for  the  neural  network  steadily  dropped  as  the  magnitude  of  the 
noise  fell.  The  lowest  of  the  three  curves  shown  in  Figure  4.24  reflects  the  performance 
of  the  neural  network  at  the  various  signal-to-noise  ratios  after  ha\iiig  been  trained 
using  tlie  10  dB  SNR  training  data  set.  its  jjerformance  i.s  equal  to  oi-  better  than  the 
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Figure  4.21:    Possible  combinations  of;/,  and  ;(/,_[   with  noise  added  (uKli 
delay) 


-1 


X  :  i,_,  =  -1 
o    :  i,_,  =  -f  1 


-2 


-1  0 


Figure  4.22:   Linear  equalizer  (with  delay)  decision  regions 


-1 


— 1 — ' T" 


X  :  1,-1  =  -1 
o    :  i,_i  =  +1 


-2  -1  0 

yi 


Figure  4.23:   Neural  network  (with  delay)  decision  region 


neural  networks  trained  and  evaluated  for  a  specific  SNR.  This  is  because  the  lower 
SNR  forced  the  conjugate  gradient  algorithm  to  produce  a  set  of  decision  boundaries 
that  were  more  nearly  optimal.  This  result  was  even  more  apparent  for  the  case  when 
a  delay  was  introduced  in  the  equalization  problem  (Figure  4.25).  The  same  method 
was  used  as  described  above,  except  that  both  the  linear  equalizer  and  neural  network 
produced  an  estimate  of  i'i_i,  rather  than  .r,.  based  on  the  recei\ed  signals  //,  and 
y,_i  .  Once  again  the  neural  network  performed  better  than  the  linear  equalizer  and 
the  neural  network  trained  using  10  dB  data  performed  the  best. 

One  final  comparison  can  be  made  between  the  neural  network  and 
the  linear  transversal  equalizer.  This  is  a  comparison  of  neural  uet  wui  k  witlioul  delay 
versus  the  linear  e(|ualizer  with  delaw  This  com])arison  is  shown  in  l'i<;\nc  1.2fi.  .Also 
shown  is  the  nem'al  network  s  perioiinance  with  a  delay.  J  lie  neural  uetwcMk  without 
delay  did  not  perform  as  well  as  the  linear  equalizer  for  low  signal  to  noise  ratios. 
As  the  magnitude  of  the  noise  was  reduced.  howe\'er,  the  perfoniiaiu-e  of  the  two 
approaches  became  com])arable.  The  neural  network  with  delaw  howcnxM.  was  better 
than  any  of  the  ap|)rt>aches. 

e.      Channel  Equalizer  Conclusions 

The  performance  of  both  a  linear  transversal  e(|u;ilizci  and  a  neural 
network  were  exaluated  with  resjiect  to  their  ability  to  accuratelx'  e(|ualize  a  nonmin- 
imum  phase  digital  data  channel.  It  was  found  that  a  linear  transx'ersal  equalizer  was 
unable  to  accurately  estimate  the  original  signal  because  of  the  chauneks  nonmini- 
mum  phase  characteristic.  The  neural  network,  because  of  its  abilit\-  tt>  loini  aibitrary 
boundaries,  did  not  suffer  from  this  problem.  Introduction  of  a  dela\,  allowed  both 
the  linear  transversal  equalizer  and  the  neural  network  to  impro\e  their  ]jerformance. 
Finally,  a  neural  network  using  no  dela}"  showed  a  comparable  ijerformance  to  a  linear 
transversal  filter  with  a  delay  for  high  signal-to-noise  ratios.  I'he  abilit\-  iA  the  neural 
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network  to  perforin  ecpialization  uitliont  introduction  of  a  delay  could  prow  uscrul, 
particularly  if  the  required  delay  is  unknown  or  varies  with  time. 
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Figure  4.26:   Equalizer  performance  -  all  methods 


V.  CONCLUSIONS  AND 
RECOMMENDATIONS 

A.      CONCLUSIONS 

The  first  objective  of  this  thesis  research  was  to  develop  an  alternative  to  the 
backpropagation  method  for  calculating  the  optimal  set  of  weights  and  thresholds  for  a 
neural  network.  The  results  presented  in  Chapter  I\^  demonstrated  that  the  conjugate 
gradient  algorithm  developed  for  this  thesis  was  more  computalioiiall\  cllic  i(Mit  tlum 
the  well  known  backpropagation  method. 

The  second  objectixe  of  this  research  was  to  develo]j  a  better  uiiderst  aiiding  of 
the  relationship  between  the  structure  of  a  neural  network  antl  its  ability  to  peifonn 
input-to-output  mapping.  .A  graphical  ai)|noach  was  usetl  to  aiialy/e  the  internal 
representations  of  the  neural  network.  The  results  of  this  analxsis  were  ])resented  in 
Chapter  II. 

The  hjial  objectiw  of  this  thesis  research  was  to  e\aluate  the  ])ei  loi  iiiance  ol  a 
neural  network  for  seveial  different  signal  processing  applicaticjus.  The  Inst  example 
presented  demonstrated  the  ability  of  a  neural  network  to  pertoiin  (  hissihc  al  i(;ii.  I  he 
second  example,  nc^nlinear  time  series  prediction,  compared  the  pei  loi  iiiaiKc  of  a 
neural  network  to  its  linear  equix'alent,  and  showed  that  the  neural  network  produced 
superior  results.  The  final  example  illustrated  the  performance  diilereiucs  between  a 
neural  network  and  a  linear  approach  to  nonminimum  phase  channel  e(|uali/.;ilioii. 

These  applications  demonstrated  that  the  nonlinear  properties  oi  a  neural  net- 
work frequently  allow  the  neural  network  to  perform  functions  more  effect i\ely  than 
its  linear  counterpart.  This  is  particularly  the  tru(^  when  the  problem  ii^fjl  is  nonlin- 
ear. It  must  be  recognized,  howexer.  that  tluMC  is  a  cost  to  this  increaNcd  Iuik  l  icjuality. 
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Calculation  of  the  proper  weights  and  thresholds  for  a  given  i)roblem  is  much  more 
computationalh"  complex.  The  computational  complexity  associated  with  tiie  use  of 
a  neural  network  must  therefore  be  balanced  with  the  accurac}-  desired  when  decid- 
ing whether  to  use  a  neural  network  rather  than  a  linear  approach  to  soke  a  gi\'en 
problem. 

B.     FUTURE  RESEARCH 

In  the  course  of  this  thesis  research,  several  other  areas  were  identified  that 
merit  additional  study. 

1.  Transfer  Function  Selection 

The  sigmoid  function  used  for  this  research  produced  an  output  tliat  ranged 
between  0  and  1.  Other  transfer  functions  could  b<Mn\cstigalc(l  that  |>rodii(('a  bipolar 
output.  This  could  i)ro\e  to  ije  more  useful  for  typical  signal  processing  ai)pli(ati(>iis. 
One  such  transfer  function  that  c(nild  be  e\aluated  is  the  h\perbolic  laimcnl  function 

e-  -  e--         1  -  (-^= 
tanh(r)  =  ^--^  =  -—^.  (5.1) 

This  nonlinear  function  produces  a  value  which  ranges  between  ±1  and  is  continuously 
differential:)le  for  all  values  of  z. 

2.  Neural  Network  Dynamic  Range 

The  perlorniance  oi  a  neural  network  having  a  greater  dynamic  range  could 
be  investigated.  The  d\naniic  range  of  the  neural  network  could  be  expanded  b\' 
allowing  adaptation  of  the  output  weight  w-j,.  It  could  also  be  accomplished  b}'  using 
a  linear  transfer  function  for  the  single  neuron  in  output  layer  of  the  network.  1  he 
output  of  the  network  would  then  be  a  linear  combination  of  the  weighted  outputs 
from  the  second  la\er  of  network.  This  is  aj^proach  taken  b\-  Lajjedes  and  Farber  in 
their  research  [Ref.  11]. 
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3.  Internal  Representations 

Thi>  thesis  niadf  no  attempt  to  analyze  the  internal  representations  used 
by  the  neural  network  to  produce  the  desired  outputs  for  a  given  set  of  inputs.  Re- 
search could  be  conducted  to  try  to  determine  exactly  what  type  of  functions  the 
individual  neurons  in  the  network  perform.  This  could  provide  further  insight  into 
the  relationship  between  the  structure  of  a  neural  network  and  its  ability  to  perform 
a  particular  task. 

4.  Analysis  of  the  Weights  and  Thresholds 

Research  could  be  performed  to  determine  if  there  is  an\'  anal\tical  signif- 
icance to  the  final  weight  and  threshold  values  for  a  neural  network. 


APPENDIX  A:  PROGRAM  OUTPUT  SCREEN 
AND  DATA  FILES 

A.  EXAMPLE  OUTPUT  SCREEN 

**  Conjugate  Gradient  Algorithm  ♦♦ 

What  is  the  name  of  the  training  data  file?  circ.dat 

How  many  inputs  to  the  neural  network?  2 

How  many  1st  layer  neurons?  4 

How  many  2nd  layer  neurons?  2 

There  will  be  only  one  3rd  layer  neuron. 

How  mainy  passes  thru  the  training  data  set?  2 

Initial  Error  sum:  40.2786 

Performing  iteration  number  1 

Beta  value:  0 

Alpha  value:  0.10958 

Error  sum:  17.3579 

Performing  iteration  number  2 
Beta  value:  0.00366825 
Alpha  value:  3.79148 
Error  sum:  17.3564 

Final  error  sum:  17.3557 

Where  do  you  want  the  results  stored?  circ.res 
**  Calculating  final  results  ** 

Where  do  you  want  the  final  weight/theta  values  stored?  circ.wgt 
♦♦  Storing  final  weight/theta  values  ** 

Where  do  you  want  the  map  matrix  stored?  circ.map 
**  Calculating  map  of  network  ** 


B.     EXAMPLE  INPUT  DATA  FILE 


4.0000e-001 
4.0000e-001 
4.0000e-001 
4.0000e-001 
4.0000e-001 
4.0000e-001 
4.0000e-001 
5.0000e-001 
5.0000e-001 
5.0000e-001 
5.0000e-001 
5.0000e-001 
5.0000e-001 
5.0000e-001 
5.0000e-001 
5.0000e-001 
5.0000e-001 
5.0000e-001 
6.0000e-001 
6.0000e-001 
6.0000e-001 
6.0000e-001 
6.0000e-001 
6.0000e-001 


4.0000e-001 
5.0000e-001 
6.0000e-001 
7.0000e-001 
8.0000e-001 
9.0000e-001 
l.OOOOe+000 
O.OOOOe+000 
l.OOOOe-001 
2.0000e-001 
3.0000e-001 
4.0000e-001 
5.0000e-00l 
6.0000e-001 
7.0000e-001 
8.0000e-001 
9.0000e-001 
l.OOOOe+000 
O.OOOOe+000 
l.OOOOe-001 
2.0000e-001 
3.0000e-001 
4.0000e-001 
5.0000e-001 


l.OOOOe+000 
l.OOOOe+000 
l.OOOOe+000 
l.OOOOe+000 
O.OOOOe+000 
O.OOOOe+000 
O.OOOOe+000 
O.OOOOe+000 
O.OOOOe+OOO 
O.OOOOe+000 
l.OOOOe+OOO 
l.OOOOe+OOO 
l.OOOOe+OOO 
l.OOOOe+OOO 
l.OOOOe+OOO 
O.OOOOe+OOO 
O.OOOOe+OOO 
O.OOOOe+OOO 
O.OOOOe+OOO 
O.OOOOe+OOO 
O.OOOOe+OOO 
l.OOOOe+OOO 
l.OOOOe+OOO 
l.OOOOe+OOO 


EXAMPLE  RESULTS  OUTPUT  DATA  FILE 


l.OOOOOOe+000 
l.OOOOOOe+000 
l.OOOOOOe+000 
l.OOOOOOe+000 
O.OOOOOOe+000 
O.OOOOOOe+000 
O.OOOOOOe+000 
O.OOOOOOe+000 
O.OOOOOOe+000 
O.OOOOOOe+000 
l.OOOOOOe+000 
l.OOOOOOe+000 
l.OOOOOOe+000 
l.OOOOOOe+000 
l.OOOOOOe+000 
O.OOOOOOe+000 
O.OOOOOOe+000 
O.OOOOOOe+000 
O.OOOOOOe+000 
O.OOOOOOe+000 
O.OOOOOOe+000 
l.OOOOOOe+000 
l.OOOOOOe+000 
l.OOOOOOe+000 


1.733739e-001 
1.738492e-001 
1.743229e-001 
1.747937e-001 
1.752599e-001 
1.757203e-001 
1.761736e-001 
1.714368e-001 
1.718988e-001 
1.723659e-001 
1.728365e-001 
1.733092e-001 
1.737825e-001 
1.742548e-001 
1.747247e-001 
1.751906e-001 
1.756512e-001 
1.761052e-001 
1.713874e-001 
1.718452e-001 
1.723086e-001 
1.727760e-001 
1.7324606-001 
1.737171e-001 


D.     EXAMPLE  FINAL  WEIGHTS  OUTPUT  DATA  FILE 


Input  weights  [?ro] 


2      4      2}   Number  of  neurons  in  each  layer 

1.023357  0.621861     -0.194039       0.288292 

-0.092301  -0.595949       0.007433     -0.795105 

-0.061310  -0.031015       0.059272       0.010853    |>  Input  thresholds  [^o] 

-0.796225  -0.522201 

-0.618687  -0.103397 

0.513663  0.325973 

-0.831033  0.906917 

-0  .  279913  0  .  139071   |  l.st  layer  thresholds  [Ox] 

0.262880   1    .     ,  ,  .    ,       r      1 

-0.542745   I -^"Jla.v^"'- -eights  [,r,] 

1.344298  }2nd  layer  threshold  [^2] 
1.000000  I  Output  weight  [ws] 


'  1st  layer  weights  [ivi] 


APPENDIX  B:  PROGRAM  SOURCE  CODE 
LISTING 


#include  <stdio.h> 
#include  <stdlib.h> 
#include  <math.h> 
#include  <float.h> 

/:tc:tc:tc>(c:4c>4c>tci«c:t<lt'******4'*********  *************  **>«'4'*>t<********  ***************/ 

/♦  This  program  calculates  the  weights  and  thresholds  for  a  */ 
/*  feedforward  multilayer  neural  network  using  the  conjugate  */ 
/*  gradient  optimization  method.  */ 

/*♦♦*****♦***♦*♦***♦***********♦*****♦**♦♦*♦*♦♦♦********♦************/ 
/***♦*♦*****♦***♦**♦****♦**♦**♦****♦*****♦***************************/ 
/*  FUNCTION  DECLARATIONS  */ 

/*******************************************************************:«/ 

int  get_info(char  filename  [],  int  num_node[]); 

int  get_data(char  filename[] .double  ts_data[] , int  num.inputs) ; 
int   init_weights (double   ♦weight.ptr , int   num_node[]); 
int   init_thetas (double   *theta_ptr, int  num_node[]); 
void  adapt_network(double  weight  []  .double  theta[],int  num.nodeD, 
int   num.weights , int   num_theta, double   data.array  [] , 
int   array.size.int   max.iteration) ; 
double  f ire_neurons(double  *activity_ptr ,   double  *weight_ptr, 

double  *theta_ptr , int  num_node[]); 
void  calc_gradient (double  activity [] .double  weight[]. 
double  theta_gradient  []  .double  gradient  [], 
int  num_node[] ,int  num_weights,int  num_theta) ; 
double  calc_beta(double  old.gradient [] .double  old_theta_gradient  [] , 
double  new_gradient [] .double  new_theta_gradient [] . 
int  num_inputs , int  num_theta) ; 
void  update_direction(double  gradient  [] .double   direction [] .double   beta. 

int   num_intputs) ; 
void  update_weights(double  weight  [] .double   alpha. double  direction[]. 

int   num_inputs); 
double  calc_alpha(double  weight  [] .double  direction[] .double  theta[] . 
double  theta_direction[]  .double   activity[], 
double  data_array  []  ,  int   array_size  .  int  num_node[], 
int  num.weights .int  num.theta) ; 
void  load.values (double   *input_ptr .double   ♦output _ptr . int   total.num) ; 
int  f ibon(int  n) ; 


void  write_result  (double  weight  []  .double  theta[],int  nuin_node[], 

double  ts_data[] ,int  set_size) ; 
void  map  .network  (double  weight  []  .double  theta[]  ,int  num_node[]); 
void  store.weights (double  weight [] .double  ♦theta.ptr , int  nmn_node[]); 

/*  MAIN  PROGRAM  */ 

/**♦******♦»**********♦*****♦*♦*********♦♦***************************/ 

mainO 

{ 

char  f  ilenajne  [14]  ; 

int  max_iteration,nuin_node  [5]  ,num_weights  ,set_size,nimi_theta; 

double  ts_data[3000] .weight [400] ,theta[50] ; 

printf("\n  **   Conjugate  Gradient  Algorithm  **   \n"); 

max.iteration  =  get.inf ©(filename .num.node) ; 

set.size  =  get_data(f ilename.ts_data.num_node[0] ) ; 

if    (set.size  ==  0){ 
exit(O) ; 

} 

num_weights=init_weights(weight .num.node) ; 

num_theta=init_thetas(theta,num_node) ; 

adapt .network (weight ,theta,num_node ,num_ weights ,num_theta, 
ts_data,set_size .max .iteration) ; 

write .result (weight .theta. num. node .ts.data, set .size) ; 

St ore. weights (weight ,theta,num_node) ; 

if  (num_nodG[0]  ==  2){ 

map.network(weight , theta, num.node) ; 

} 
exit(O) ; 
} 

/*  FUNCTION  GET. INFO  */ 

/,**♦***«*♦****♦************♦***♦*♦*♦****♦**♦♦***********************/ 
int  get_info(char  f  ilename[]  ,  int  num_node[]) 
{ 

int  max. iteration; 

printf("\n  What    is   the  name   of   the   training  data  file?      "); 

flushallO; 

gets(f ilename) ; 

printf("\n  How  mainy   inputs  to  the  neural  network?      "); 

scanf  ("7,2hd"  .&num.node[0]  ) ; 

printf("\n  How  mamy   1st  layer  neurons?     "); 

scanf  ("'/.2hd"  ,&num_node  [1]  )  ; 


printf("\n  How  many  2nd  layer  neurons?     "); 
scanf  ("y,2hd"  ,&nuin_node[2]  ) ; 

printf("\n  There  will  be  only  one  3rd  layer  neuron.      "); 
num.node  [3]    =   1 ; 
nuin.node  [4]    =    1 ; 

printf("\n\n  How  many  passes  thru  the  training  data  set?      "); 
scanf  ("y,4hd"  ,&max_iteration)  ; 
return (max.iteration) ; 
} 

/*  FUNCTION  GET.DATA  */ 

int  get_data(char  filenajne[]  .double  ts_data[]  , int  num_inputs) 
{ 

FILE  *stream; 

int  i.num.read; 

num_read  =  0; 

if    ((stream  =   f open(&f ilename[0] , "r") )    !=   NULL){ 
for   (i=0;(i   <   3000)&& 
(f  scanf  (stream,  "y.lg"  ,ts_data  +   i)>0);i++) 

f close(stream) ; 

if  ((iy.(num_inputs+l))  !=  0){ 

printf("\n\n  **  Improper  number  of  input  data  elements  **"); 
num_read  =  0; 
} 
else{ 

num.read  =  i/ (num. input s+1) ; 
} 
} 
else{ 

printf("\n\n  **  Could  not  find  the  specified  file  **"); 
} 

return (num.read) ; 
} 

/*  FUNCTION  INIT.WEIGHTS  */ 

int  init .weights (double  *weight_ptr , int  num_node[]) 
{ 

#define  MAX.VAL  16384.0 

int  num.weights , i; 

srand(l) ; 
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num.weights  =  0; 
for  (i=0;i<3;i++){ 

num.weights  +=  num_node[i]  *nvmi_node[i+l]  ; 
} 
for   (i=0;i<(nuin_node[0]*nuin_node[l]  )  ;i++){ 

♦weight_ptr++  =  (1.0  -  (rand()/MAX_VAL) ) ; 
} 
for  (i=0;i<(nmn_node[l]*nuin_node[2])  ;!++)■[ 

♦weight _ptr++  =  (1.0  -  (randO/MAX.VAL)) ; 
} 
for  (i=0;i<(nuin_node[2]  *nuin_node[3]  )  ;i++){ 

♦weight _ptr++  =  (1.0  -  (rand()/MAX_VAL)) ; 
} 

♦weight _ptr  =  1.0; 
num.weights  +=  1; 
return(num_weights) ; 
} 

/•  FUNCTION  INIT.THETAS  */ 

int  init.thetas (double  *theta_ptr, int  num_node[]) 
{ 

int  num.theta.i ; 

num_theta  =  num_node[l]+num_node[2] +num_node [3]  ; 
for   (i=0;i<num_theta; i++){ 

♦thGta_ptr++  =0.0; 
} 

return(num_theta) ; 
} 

/*  FUNCTION  ADAPT.NETWORK  */ 

void  adapt  _network(double  weight  []  .double  theta[],int  num.nodeD, 
int  num.weights, int  num.theta, double  data.array [] , 
int   array_size,int  max.iteration) 
{ 

int   iteration.i, j ,set_num; 

double   activity  [50]  .gradient  [400]  ,direction[400]  ,gradient_suin[400]  ; 

double  actual .output .desired .output .alpha.beta.old.gradient.mag; 

double  theta. gradient [50] ,theta.sum[50] ,theta.direction[50]  ; 

double   old.gradient_sum[50] .old. theta. sum[50] .error .errorsum; 

double   *array_ptr; 

for  ( it eration=0; iter at ion<max. iteration; iteration++){ 
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for    (i=0;i<num_weights; i++){ 

gradient .sum [i]    =  0.0; 
} 
for   (i=0;i<nuin_theta;i++){ 

theta_suin[i]    =  0.0; 
} 

errorsum=0 .0; 
array_ptr  =  data_array; 
for    ( s et_nuni=0;set_nuin< array _size;  set _nuin++){ 

for   (i=0;i<nuin_node[0]  ;  i++){ 
activity[i]    =   *array_ptr++; 

} 

desired.output  =  ♦array _ptr++ ; 

actual_output=f ire .neurons (activity .weight, thet a, num.node) ; 

error  =  actual_output  -  desired_output ; 

error  *=  error; 

gradient  [nuin_weights-l]  =  (actual.output  -  desired.output)* 
actual .output ; 

calc_gradient(activity , weight ,theta_gradient .gradient ,num_node , 
num.weights ,num_theta) ; 

for  (i=0;  i<(nuin_weights-l)  ;i  +  +  ){ 
gradient .sum [i]  +=  gradient [i]; 

} 

for    (i=0;  i<nuin_theta;  i  +  +  ){ 
theta_sum[i]    +=  theta_gradient [i] ; 

} 

errorsum  +=  error; 
} 

printfC  Error  sum:  '/,lg  \n"  .errorsum) ; 
if  (iteration  ==  0){ 

beta  =  0.0; 
} 
else{ 

beta  =   calc_beta(old_gradient_sum,old_theta_sum,gradient_sum, 
theta.sum,    (num_weights-l) ,num_theta) ; 
} 
for    (j=0; j<(num_weights-l) ; j++){ 

old_gradient_sum[j]    =  gradient_sum[j]  ; 
} 
for    (j=0; j<num_theta; j++){ 

old_theta_sum[j]    =  theta_sum[j]  ; 
> 

printf("\n  Performing  iteration  number  '/,d   \n"  ,  (iteration+1) )  ; 
printf("   Beta  value:    '/.Ig   \n",beta); 
update_direction(gradient_sum, direction, beta, (num_weights-l) ) ; 
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update_direction(theta_suni,theta_direction,beta,nmn_theta)  ; 

alpha=calc_alpha (weight .direct ion, thet a, theta_direct ion, activity , 
data_array ,  array_size  ,niiin_node  ,nuin_weights  , 
niim_theta)  ; 

printfC   Alpha  value:    '/.Ig   \n", alpha); 

update_weights (weight , alpha , direction , (num.weights- 1 ) ) ; 

updat e .weights (thet a , alpha , thet a.direct ion .niim.thet a)  ; 
} 

errorsum  =  0.0; 
array _ptr  =  data.array; 
for   (set  _nmn=0;  set  _niun<  array  .size;  set  _nuin++){ 

for    (i=0;i<niim.node[0]  ;  i++){ 
activity  [i]    =   *array.ptr++; 

} 

desired.output  =  *array.ptr++; 

actual.output=f ire. neurons (activity , weight ,theta, num. node) ; 

error  =  actual. output  -  desired.output; 

error  ♦=  error; 

errorsum  +=  error; 
} 

printf("\n  Final  error  sum:  '/,lg  \n"  ,  errorsum)  ; 
return; 
> 

/*  FUNCTION  FIRE.NEURGNS  */ 

/♦»»**********♦*************»********♦**************♦*♦**************/ 
double  f ire_neurons(double  *activity_ptr .double  *weight.ptr, 

double  ^theta.ptr, int  num_node[]) 
{ 

int  layer. num, neuron. num. j ; 

double  temp , *input_ptr , *output.ptr ; 

input. ptr  =  activity.ptr ; 

output .ptr  =  activity.ptr  +  num. node [0] ; 

/*  Feed  input  forward  thru  each  layer  of  the  network  */ 
for  (layer.n;im=0;layer.num<3; layer. num++){ 

for  (neuron_num=0; neuron. num  <  num.node [layer. num+1] ; neuron. num++){ 
temp  =  0.0; 
for  (j=0;j  <  num.node  [layer. num] ;j++){ 

temp  -=  (*weight_ptr++)*(input.ptr [j] ) ; 
} 

temp  +=  *theta_ptr++ ; 
♦output. ptr++  =  1 .0/(1 .0+exp(temp)) ; 

} 
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input  ptr  +=  n\iin_node[layer_nmn]  ;  j 

>  I 

temp   =    (*input_ptr)    *    (*weight_ptr) ; 
return(temp) ; 
} 

/*  FUNCTION  CALC.GRADIENT  */ 

void  calc_gradient (double  activity [] .double  weight  [] , 
double  theta.gradient  []  .double  gradient  [], 
int  num.node [] , int  num.weights. int  num_theta) 
{ 

int   layer.nmn, i , j .off set ; 

double  *weight_ptr . *gradient_ptr . *result_gradient_ptr ; 

double  *output_acty_ptr , *input_acty_ptr , temp . ♦theta.ptr ; 

weight.ptr  =  ftweight [num_weights-l] ; 

gradient_ptr  =  ^gradient [num_weights-l] ; 

result_gradient_ptr  =  gradient _ptr  -    1; 

output_acty_ptr  =  &activity[0]    +    (num_node[0] +nmn_node [l] +num_node [2] ) ; 

input_acty_ptr  =     output_acty_ptr  -    1; 

theta_ptr  =   &theta_gradient [num_theta-l]  ; 

for    (layer.nmn  =   2 ; layer_num>- 1 ; layer_num--){ 
for    (j=0; j<num_node [layer _num  +    1]  ; j++){ 
temp  =0.0; 
offset  =  0; 

for    (i=0; i<num_node[layer_num+2]  ;i++){ 
temp  +=    (*weight_ptr)    *    (*gradient_ptr) ; 
weight _ptr  -=  num_node [layer _num+ 1] ; 
gradient _ptr   -=  num_node[layer_num+l] ; 
} 

offset   =    (num.node  [layer_num+2] *nmn_node [layer _num+l] )-l ; 
weight _ptr   +=   offset; 
gradient _ptr  +=  offset; 
temp  *=    (1.0  -    (*output_acty_ptr--)) ; 
for   (i=0;  i<nimi_node[layer_num]  ;i++){ 
(*result_gradient_ptr--)   =  temp  *    (*input_acty_ptr--) ; 
} 

*theta_ptr--   =    (-temp) ; 
input_acty_ptr  +=  num_node[layer_num]  ; 
} 

input_acty_ptr  -=  num_node[layer_num] ; 
} 
return ; 


> 

/*  FUNCTION  UPDATE.WEIGHTS  */ 

/*♦*♦♦♦«**  ********************:^  +  +  :^*  +  :^***:^**:^c*********  + ***********♦***/ 

void  update_weights(double  weight [] .double  alpha,  double  direction[]  , 

int  num.inputs) 
{ 

int  i ; 

for    (i=0;  i<nuin_inputs  ;  i+  +  ){ 

weight  [i]    +=   alpha*direction[i] ; 
} 

return; 
} 

/*  FUNCTION   CALC.BETA  */ 

double   calc_beta(double   old.gradient [] .double  old_theta_gradient  []  , 
double  new_gradient[] .double  neu_theta_gradient [] , 
int  num. input s , int  num.theta) 
{ 

int   i  ; 

double  beta, tempi .temp2 ; 

t  emp 1  =  0.0; 
temp2  =0.0; 
for  (i=0; i<num_inputs ; i++){ 

tempi  +=  ((new.gradient [i] -old.gradient [i] )*new_gradient [i] ) ; 

temp2  +=  (old_gradient[i]  *  old_gradient [i] ) ; 
} 
for  (i=0; i<num_theta; i++){ 

tempi  +=  ( (new_theta_gradient [i] -old_theta_gradient[i]  )* 
new_theta_gradient [i] ) ; 

temp2  +=  (old_theta_gradient [i]  *  old_theta_gradient  [i] ) ; 
} 

beta  =  templ/temp2; 
if  (beta  <  0.0){ 

beta  =  0.0; 
} 

return(beta) ; 
} 

/*  FUNCTION  UPDATE.DIRECTION  */ 

void  update_direction(double  gradient[],    double   direction[], 


double  beta,  int  num.inputs) 
{ 

int  i; 

for  (i=0;i<nuin_inputs;  i++){ 

directionCi]  ♦=  beta; 

direction  [i]  -=  gradient [i]; 
} 

return; 
} 

/*  FUNCTION  CALC.ALPHA  */ 

/i^:ti:ti:ti*^i*:t'***  ****************************  ****if*********  ******* 

double   calc_alpha(double   weight  [] .double  direction[]  .double   theta[]  , 
double  theta_direction[]  .double   activity[], 
double  data.array  []  ,  int  array_size,  int  nuni_node[], 
int  num.weights ,int  num.theta) 
{ 

double  a.b.lamda.mu.lamda.result ,mu_result ,desired_result .epsilon; 
double   actual.result .test.weight [500] ,test_theta[50] ,*array_ptr; 
int    i .k,set_num,max_steps ; 

a  =  0.0; 

b  =  10.0; 

max_steps  =  16; 

epsilon  =  0.001 ; 

lamda  =  a+( (b-a)*f ibon(max_steps-2)/f ibon(max_steps) ) ; 

mu  =  a+((b-a)*f ibon(max_steps-l)/f ibon(max_steps) ) ; 

a  -=  lamda; 

b  -=  lamda; 

mu  -=  lamda; 

lamda  =0.0; 

load_values (weight .test_weight .num.weights) ; 

load_values(theta,test_theta,num_theta) ; 

update .weights (test .weight .lamda. direction, (num.weights- 1) ) ; 

update. weights (test.theta, lamda, theta.direct ion, num. theta) ; 

lamda.result  =  0.0; 

array _ptr  =  data.array; 

for  (set_num=0; set .num< array. size; set.num++){ 

for  (i=0;i<num.node[0]  ; i++){ 

activity  [i]    =   *array_ptr++ ; 

} 

desired. result  =  *array.ptr++; 

actual. resul t =f ire .neurons ( activity , test .weight , test. theta, 
num. node) ; 
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actual.result  -=  desired_result ; 

actual_result  *=  actual.result ;  ! 

lamda.result  +=  actual.result; 
} 

load_values (weight , test .weight ,num_weights) ; 

load_values(theta,test_theta,nuin_theta) ;  | 

update_weights (test .weight ,mu, direct ion, (num_weights-l) ) ;  i 

update_weights(test_theta,mu,theta_direction,nuin_theta)  ;  ' 

mu.result  =  0.0;  i 

array _ptr  =  data.array;  ! 

for  (set_nuin=0;set_num<array_size;set_num++){  ^ 

for  (i=0;i<nuin_node[0]  ;i++){  i 

activity  [i]  =  *array_ptr++ ; 
}  I 

desired.result  =  *array_ptr++; 
actual_result=f ire .neurons (activity , test. weight ,test_theta , 

num. node) ;  j 

actual.result  -=  desired.result ;  ] 

actual.result  *=  actual.result;  ^ 

mu.result  +=  actual.result;  | 

} 

for  (k=l;(k<(max_steps-l))&&(b>0.0)  ;k++)-C  | 

if  (lamda. result  >  mu_result){  I 

a  =  lamda;  i 

lamda  =  mu;  I 

lamda.result  =  mu.result ; 
mu  =  ((b-a)/f ibon(max.steps-k) ) ; 
mu  *=  f ibon(max_steps-k-l) ; 
mu  +=  a; 

load.values(weight  .test. weight  .nuin.weights)  ; 
load.values(theta,test_theta,nuin.theta)  ; 

update .weights (test. weight ,mu, direct ion, (num.weights-1) ) ; 
updat e. weights (test.thet a, mu,theta_direct ion ,num_theta) ; 
mu.result  =  0.0; 
array _ptr  =  data. array; 

for  ( s  et. num=0 ;  set  .nuiii<  array  .size  ;set_nuin++){ 
for  (i=0  ;  i<nuin_node[0]  ;  i  +  +  ){ 
activity[i]  =  *array_ptr++ ; 
} 

desired.result  =  *array.ptr++ ; 
actual. result=fire.neurons(activity , test .weight , test. theta, 

n\iin_node)  ; 
actual.result  -=  desired.result; 
actual.result  *=  actual.result; 
mu.result  +=  actual.result; 


} 
} 

else{ 
b  =  mu; 
mu  =  lamda; 

mu_result  =  lamda.result ; 
lamda  =  ((b-a)/f ibon(max_steps-k) ) ; 
lamda  *=  f ibon(max_steps-k-2) ; 
lamda  +=  a; 

load.values (weight ,test_weight ,num_weights) ; 
load_values(theta,test_theta,num_theta) ; 

update_weights(test_weight , lamda, direction, (num_weights-l) ) ; 
update_weights(test_theta,lamda,theta_direction,num_theta) ; 
lamda_result  =  0.0; 
array _ptr  =  data.array; 

for  (set_num=0;set_num<array_size ; set_num++){ 
for  (i=0;i<num_node[0] ; i++){ 
activity  [i]  =  *array_ptr++ ; 
} 

desired.result  =  *array_ptr++ ; 
actual_result=f ire_neurons(activity ,test_weight ,test_theta, 

num_node) ; 
actual.result  -=  desired_result ; 
actual.result  ♦=  actual.result ; 
lamda.result  +=  actual.result ; 
} 
} 
} 
if  (b>0.0){ 

mu  =  lamda  +  epsilon; 

load.values (weight ,test_weight ,num_weights) ; 
load_values(theta,test_theta,num_theta) ; 

update_weights (test .weight ,mu, direction, (num_weights-l) ) ; 
update_weights(test_theta,mu,theta_direction,num_theta) ; 
mu_result  =  0.0; 
array _ptr  =  data_array; 

for  (set_num=0;set_num<array_size;  set_nuni++){ 
for  (i=0;i<num_node[0] ; i++){ 
activity  [i]  =  *array_ptr++; 
} 

desired.result  =  *array_ptr++; 

actual_result=fire_neurons (activity , test .weight ,test_theta, 
num.node) ; 

actual.result  -=  desired.result ; 
actual.result  *=  actual.result ; 


mu.result  +=  actual_result ; 
} 

if  (lamda.result  >  inu_result){ 
if  ((lainda+b)>  0.0){ 
return ((lamda+b)/2.0) ; 
} 

else{ 
return(O.O) ; 
> 
} 
else{ 

if  ((lamda+a)>  0.0){ 
return((lamda+a)/2.0) ; 
} 

else{ 
return(O.O) ; 
} 
} 
} 
else{ 

return(O.O) ; 
} 
> 

/*  FUNCTION  LOAD.VALUES  */ 

/:.♦**♦♦**♦**,*♦»**»****»***♦*»»»***»****»*******************♦*******♦/ 

void  load_values(double  * input _ptr, double  *output_ptr  ,int  total.num) 

{ 

int  i; 

for  (i=0;  i<tGtal_nuin;  i++){ 

♦output _ptr++  =  ♦input _ptr++; 

} 
return; 
} 

/♦  FUNCTION  FIBON  */ 

/************♦***********♦******♦**********************♦**♦**********/ 
int  fibonCint  n) 
{ 

int  f0,fl.f2,k; 

f2=fl=f0=l; 
if  (n  <  2){ 
return(l) ; 


} 

for    (k=l;k<n;k++){ 

fO  =  fl   +  f2; 

f2   =  fl; 

fl   =  fO; 
} 

return(f 0) ; 
} 

/*  FUNCTION  WRITE.RESULT  */ 

void  write_result  (double  weight  []  .double  theta[],int  nuin_node[], 

double  ts_data[] ,int  set.size) 
{ 

FILE  ♦fileptr; 

char  f name [14] ; 

int   i,set_nuin; 

double  desired.result .result .activity [50] ,*array_ptr; 

printf("\n\n  Where  do  you  want  the  results  stored?  "); 

flushallO; 

gets (&f name [0]) ; 

printf("\n  **  Calculating  final  results  ♦*  \n"); 

fileptr  =  f open (&f name [0] , "w") ; 

array_ptr  =  ts_data; 

for   (set_num=0; set _num< set .size; set_num++){ 

for    (i=0; i<num_node[0] ; i++){ 
activity[i]    =   *array_ptr++ ; 

} 

desired.result  =  *array_ptr++; 

result  =  fire_neurons(activity .weight ,theta,num_node) ; 

fprintf  (fileptr,  "  '/.e    '/.e  \n"  ,desired_result  .result)  ; 
} 

f close(f ileptr) ; 
return; 
} 

/♦  FUNCTION  MAP.NETWORK  */ 

void  map_network(double  weight  []  .double  theta[]  ,int  nuin_node[]) 
i 

int  row, col; 

double  result , input  1 , input2 . activity [50]  ; 

FILE  *f ileptr; 

char  fname [13] ; 
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printf("\n\n  Where  do  you  want  the  map  matrix  stored?  "); 
flushallO; 
gets (&f name  [0])  ; 

printf("\n  **  Calculating  map  of  network  **\n"); 
fileptr  =  fopen(&fname[0] ,"w") ; 
input l=input2=0 . 0 ; 
for  (row=0;row<21;row++){ 
for  (col=0;col<2l;col++){ 
activity [0] =input 1 ; 
activity  [l] =input2 ; 

result=fire_neurons(activity, weight ,theta,num_node) ; 
fprintf  (fileptr,"  '/.e"  .result)  ; 
input  1  +=  0.05; 
} 

fprintf (fileptr, "\n") ; 
input  1  =  0.0; 
input2  +=  0.05; 
} 

f close(f ileptr)  ; 
return; 
} 

/*  FUNCTION  STORE.WEIGHTS  */ 

/****************»*****************♦********♦♦***********************/ 
void  store.weights (double  weight [] .double  *theta_ptr . int  num_node[]) 
{ 

int  i. j .k; 

double  *weight_ptrl .*weight_ptr2; 

char  fname[l3]  ; 

FILE  *fileptr; 

printf("\n\n  Where  do  you  want  the  final  weight/theta  values  stored?  "); 

flushallO  ; 

gets  (ftf name  [0])  ; 

printf("\n  **   Storing  final  weight/theta  values  **\n"); 

fileptr  =  fopen(ftfname[0] ,"w") ; 

for  (i=0;i<3;i++){ 

fprintf  (fileptr.  '"/Ad"  .num_node[i]  )  ;  ^ 

} 

fprintf (fileptr  ."\n") ; 
weight_ptr2  =  weight; 
for  (i=0;i<3;i++){ 

weight _ptrl  =  weight _ptr2; 

for  (j=0; j<num_node[i] ; j++){ 


weight _ptrl   =  weight _ptr2  +  j ; 
for   (k=0;k<nuin_node[i+l]  ;k++){ 
fprintf  (fileptr,"y.l0.61f   ",    *weight_ptrl) ; 
weight _ptrl  +=  nmn_node[i] ; 
} 

fprintf (fileptr, "   \n"); 
} 

weight _ptr2   +=    (nuin_node[i]*niiin_node[i+l]  )  ; 
for    (j=0;  j<niiin_node[i+l]  ; j++){ 

fprintf  (fileptr, '"/.lO.eif  ",  *theta_ptr++) ; 
} 

fprintf (fileptr, "  \n"); 
} 

fprintf  (fileptr,  "7.10. 61f  \n"  ,*weight_ptr2) ; 
f close(f ileptr) ; 
return; 
} 
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